Status: Seedling
December 2021

Euclidean Distance and Pearson Correlation

I recently learned there’s a useful connection between Euclidean distance and Pearson correlation: the squared Euclidean distance between two z-normalized vectors equals their Pearson distance, up to a scale factor.

This short note walks through a derivation.

Z-normalization

Let’s say we have two z-normalized vectors xx and yy in NN dimensions.

Z-normalizing a vector involves subtracting its mean then dividing by its standard deviation, so we know that the means μx\mu_x and μy\mu_y of our vectors xx and yy are zero:

μx=μy=0 \mu_x = \mu_y = 0

and their standard deviations σx\sigma_x and σy\sigma_y are one:

σx=σy=1 \sigma_x = \sigma_y = 1

We’re going to introduce some statistical concepts and see how they simplify under these constraints. Expressions in color are simplifications valid only under z-normalization.

Summations implicity sum over all terms: xi=iNxi\small{\sum{x_i} = \sum_i^Nx_i}

Covariance

Since the means μx\mu_x and μy\mu_y are both zero, the numerator reduces to a dot product:

cov(x,y)=(xiμx)(yiμy)N \operatorname{cov}(x, y) = {\sum{(x_i - \mu_{x})(y_i - \mu_{y})} \over N}

cov(x,y)=xiyiN \htmlClass{special}{\operatorname{cov}(x, y) = {{\sum{x_iy_i}} \over N}}

Standard deviation

The standard deviations σx\sigma_x and σy\sigma_y both simplify to one by the definition of z-normalization:

σx=cov(x,x) \sigma_x = \sqrt{cov(x, x)}

σx=xi2N=1 \htmlClass{special}{\sigma_x = \sqrt{\sum{x_i^2} \over N} = 1}

Pearson correlation

The denominator reduces to one, leaving us with just the covariance:

P(x,y)=cov(x,y)σxσy \operatorname{P}(x, y) = {\operatorname{cov}(x, y) \over \sigma_x\sigma_y}

P(x,y)=cov(x,y)=xiyiN \htmlClass{special}{\operatorname{P}(x, y) = \operatorname{cov}(x, y) = {{\sum{x_iy_i}} \over N}}

Euclidean distance

Now let’s look at squared Euclidean distance between z-normalized vectors.

dist2(x,y)=(xiyi)2 \operatorname{dist^2}(x, y) = \sum{(x_i - y_i)^2}

We can expand out the squared differences inside the sum:

dist2(x,y)=(xi22xiyi+yi2) {\operatorname{dist^2}(x,y) =}\sum{(x_i^2 - 2x_iy_i + y_i^2)} then separate out the terms into their own sums so we can tackle them one by one:

dist2(x,y)=xi22xiyi+yi2 {\operatorname{dist^2}(x,y) =} \sum{x_i^2} - 2\sum{x_iy_i} + \sum{y_i^2}

Let’s examine this. The two sum-of-squares terms, xi2\sum{x_i^2} and yi2\sum{y_i^2}, are both NN times the squared standard deviation, which reduces to NN:

xi2N times σx2 = N2xiyi+yi2N times σy2 = N \underbrace{\sum{x_i^2}}_{N \text{ times } \sigma^2_x ~=~ N} \htmlClass{light}{ - 2\sum{x_iy_i} +} \underbrace{\sum{y_i^2}}_{N \text{ times } \sigma^2_y ~=~ N}

Meanwhile, the middle term is 2N2N times P(x,y)\operatorname{P}(x, y):

xi22xiyi2N times P(x,y)+yi2 \htmlClass{light}{\sum{x_i^2} -} \underbrace{2\sum{x_iy_i}}_{2N \text { times } \operatorname{P}(x, y)} \htmlClass{light}{+ \sum{y_i^2}} Putting these pieces together, our formula becomes:

dist2(x,y)=2N2NP(x,y) \htmlClass{special}{\operatorname{dist^2}(x, y) = 2N - 2N \cdot \operatorname{P}(x, y)}

which can be simplified to:

dist2(x,y)=2N(1P(x,y)) \htmlClass{special}{\operatorname{dist^2}(x, y) = 2N(1 - \operatorname{P}(x, y))}

showing that the squared Euclidean distance between two z-normalized vectors is 2N2N times their Pearson distance.

Why does this matter?

A common trick to improve Euclidean nearest-neighbor searches is to z-normalize the vectors that you’re searching over. The connection to Pearson correlation gives us an intuition for why this works so well in practice.

Additionally, representing correlation as a distance metric allows us to use it with algorithms that rely on properties of a distance metric, such as the triangle inequality. Despite the name, Pearson distance is not actually a distance.

References

On Clustering Time Series Using Euclidean Distance and Pearson Correlation:

For time series comparisons, it has often been observed that z-score normalized Euclidean distances far outperform the unnormalized variant. In this paper we show that a z-score normalized, squared Euclidean Distance is, in fact, equal to a distance based on Pearson Correlation.

↑ Back to top