Euclidean Distance and Pearson Correlation

Note: This was originally published as an Observable notebook.

I recently learned there's a useful connection between Euclidean distance and Pearson correlation: the squared Euclidean distance between two z-normalized vectors equals their Pearson distance, up to a scale factor.

This short note walks through a derivation.

Z-normalization

Let's say we have two z-normalized vectors and in dimensions.

Z-normalizing a vector involves subtracting its mean then dividing by its standard deviation, so we know that the means and of our vectors and are zero:

and their standard deviations and are one:

We're going to introduce some statistical concepts and see how they simplify under these constraints. Expressions in color are simplifications valid only under z-normalization.

Summations implicity sum over all terms:

Covariance

Since the means and are both zero, the numerator reduces to a dot product:

Standard deviation

The standard deviations and both simplify to one by the definition of z-normalization:

Pearson correlation

The denominator reduces to one, leaving us with just the covariance:

Euclidean distance

Now let's look at squared Euclidean distance between z-normalized vectors.

We can expand out the squared differences inside the sum:`

then separate out the terms into their own sums so we can tackle them one by one:`

Let's examine this. The two sum-of-squares terms, and , are both times the squared standard deviation, which reduces to :`

Meanwhile, the middle term is times :`

Putting these pieces together, our formula becomes:`

which can be simplified to:`

showing that the squared Euclidean distance between two z-normalized vectors is times their Pearson distance.

Why does this matter?

A common trick to improve Euclidean nearest-neighbor searches is to z-normalize the vectors that you're searching over. The connection to Pearson correlation gives us an intuition for why this works so well in practice.

Additionally, representing correlation as a distance metric allows us to use it with algorithms that rely on properties of a distance metric, such as the triangle inequality. Despite the name, Pearson distance is not actually a distance.

References

On Clustering Time Series Using Euclidean Distance and Pearson Correlation

For time series comparisons, it has often been observed that z-score normalized Euclidean distances far outperform the unnormalized variant. In this paper we show that a z-score normalized, squared Euclidean Distance is, in fact, equal to a distance based on Pearson Correlation.

Thanks to Dillon Shook for comments on an early draft of this note.