December 2021

Euclidean Distance and Pearson Correlation

I recently learned there’s a useful connection between Euclidean distance and Pearson correlation: the squared Euclidean distance between two z-normalized vectors equals their Pearson distance, up to a scale factor.

This short note walks through a derivation.

Z-normalization

Let’s say we have two z-normalized vectors $x$ and $y$ in $N$ dimensions.

Z-normalizing a vector involves subtracting its mean then dividing by its standard deviation, so we know that the means $\mu_x$ and $\mu_y$ of our vectors $x$ and $y$ are zero:

$\mu_x = \mu_y = 0$

and their standard deviations $\sigma_x$ and $\sigma_y$ are one:

$\sigma_x = \sigma_y = 1$

We’re going to introduce some statistical concepts and see how they simplify under these constraints. Expressions in color are simplifications valid only under z-normalization.

Summations implicity sum over all terms: $\small{\sum{x_i} = \sum_i^Nx_i}$

Covariance

Since the means $\mu_x$ and $\mu_y$ are both zero, the numerator reduces to a dot product:

$\operatorname{cov}(x, y) = {\sum{(x_i - \mu_{x})(y_i - \mu_{y})} \over N}$

$\htmlClass{special}{\operatorname{cov}(x, y) = {{\sum{x_iy_i}} \over N}}$

Standard deviation

The standard deviations $\sigma_x$ and $\sigma_y$ both simplify to one by the definition of z-normalization:

$\sigma_x = \sqrt{cov(x, x)}$

$\htmlClass{special}{\sigma_x = \sqrt{\sum{x_i^2} \over N} = 1}$

Pearson correlation

The denominator reduces to one, leaving us with just the covariance:

$\operatorname{P}(x, y) = {\operatorname{cov}(x, y) \over \sigma_x\sigma_y}$

$\htmlClass{special}{\operatorname{P}(x, y) = \operatorname{cov}(x, y) = {{\sum{x_iy_i}} \over N}}$

Euclidean distance

Now let’s look at squared Euclidean distance between z-normalized vectors.

$\operatorname{dist^2}(x, y) = \sum{(x_i - y_i)^2}$

We can expand out the squared differences inside the sum:

${\operatorname{dist^2}(x,y) =}\sum{(x_i^2 - 2x_iy_i + y_i^2)}$ then separate out the terms into their own sums so we can tackle them one by one:

${\operatorname{dist^2}(x,y) =} \sum{x_i^2} - 2\sum{x_iy_i} + \sum{y_i^2}$

Let’s examine this. The two sum-of-squares terms, $\sum{x_i^2}$ and $\sum{y_i^2}$ , are both $N$ times the squared standard deviation, which reduces to $N$ :

\underbrace{\sum{x_i^2}}_{N \text{ times } \sigma^2_x ~=~ N} \htmlClass{light}{ - 2\sum{x_iy_i} +} \underbrace{\sum{y_i^2}}_{N \text{ times } \sigma^2_y ~=~ N}

Meanwhile, the middle term is $2N$ times $\operatorname{P}(x, y)$ :

$\htmlClass{light}{\sum{x_i^2} -} \underbrace{2\sum{x_iy_i}}_{2N \text { times } \operatorname{P}(x, y)} \htmlClass{light}{+ \sum{y_i^2}}$ Putting these pieces together, our formula becomes:

$\htmlClass{special}{\operatorname{dist^2}(x, y) = 2N - 2N \cdot \operatorname{P}(x, y)}$

which can be simplified to:

$\htmlClass{special}{\operatorname{dist^2}(x, y) = 2N(1 - \operatorname{P}(x, y))}$

showing that the squared Euclidean distance between two z-normalized vectors is $2N$ times their Pearson distance.

Why does this matter?

A common trick to improve Euclidean nearest-neighbor searches is to z-normalize the vectors that you’re searching over. The connection to Pearson correlation gives us an intuition for why this works so well in practice.

Additionally, representing correlation as a distance metric allows us to use it with algorithms that rely on properties of a distance metric, such as the triangle inequality. Despite the name, Pearson distance is not actually a distance.

References

On Clustering Time Series Using Euclidean Distance and Pearson Correlation:

For time series comparisons, it has often been observed that z-score normalized Euclidean distances far outperform the unnormalized variant. In this paper we show that a z-score normalized, squared Euclidean Distance is, in fact, equal to a distance based on Pearson Correlation.