Euclidean Distance and Pearson Correlation
Note: This was originally published as an Observable notebook.
I recently learned there's a useful connection between Euclidean distance and Pearson correlation: the squared Euclidean distance between two z-normalized vectors equals their Pearson distance, up to a scale factor.
This short note walks through a derivation.
Z-normalization
Let's say we have two z-normalized vectors
Z-normalizing a vector involves subtracting its mean then dividing by its standard deviation, so we know that the means
and their standard deviations
We're going to introduce some statistical concepts and see how they simplify under these constraints. Expressions in color are simplifications valid only under z-normalization.
Summations implicity sum over all terms:
Covariance
Since the means
Standard deviation
The standard deviations
Pearson correlation
The denominator reduces to one, leaving us with just the covariance:
Euclidean distance
Now let's look at squared Euclidean distance between z-normalized vectors.
We can expand out the squared differences inside the sum:`
then separate out the terms into their own sums so we can tackle them one by one:`
Let's examine this. The two sum-of-squares terms,
Meanwhile, the middle term is
Putting these pieces together, our formula becomes:`
which can be simplified to:`
showing that the squared Euclidean distance between two z-normalized vectors is
Why does this matter?
A common trick to improve Euclidean nearest-neighbor searches is to z-normalize the vectors that you're searching over. The connection to Pearson correlation gives us an intuition for why this works so well in practice.
Additionally, representing correlation as a distance metric allows us to use it with algorithms that rely on properties of a distance metric, such as the triangle inequality. Despite the name, Pearson distance is not actually a distance.
References
For time series comparisons, it has often been observed that z-score normalized Euclidean distances far outperform the unnormalized variant. In this paper we show that a z-score normalized, squared Euclidean Distance is, in fact, equal to a distance based on Pearson Correlation.
Thanks to Dillon Shook for comments on an early draft of this note.