Euclidean Distance and Pearson Correlation
I recently learned there’s a useful connection between Euclidean distance and Pearson correlation: the squared Euclidean distance between two z-normalized vectors equals their Pearson distance, up to a scale factor.
This short note walks through a derivation.
Z-normalization
Let’s say we have two z-normalized vectors $x$ and $y$ in $N$ dimensions.
Z-normalizing a vector involves subtracting its mean then dividing by its standard deviation, so we know that the means $\mu_x$ and $\mu_y$ of our vectors $x$ and $y$ are zero:
$$ \mu_x = \mu_y = 0 $$
and their standard deviations $\sigma_x$ and $\sigma_y$ are one:
$$ \sigma_x = \sigma_y = 1 $$
We’re going to introduce some statistical concepts and see how they simplify under these constraints. Expressions in color are simplifications valid only under z-normalization.
Summations implicity sum over all terms: $\small{\sum{x_i} = \sum_i^Nx_i}$
Covariance
Since the means $\mu_x$ and $\mu_y$ are both zero, the numerator reduces to a dot product:
$$ \operatorname{cov}(x, y) = {\sum{(x_i - \mu_{x})(y_i - \mu_{y})} \over N} $$
$$ \htmlClass{special}{\operatorname{cov}(x, y) = {{\sum{x_iy_i}} \over N}} $$
Standard deviation
The standard deviations $\sigma_x$ and $\sigma_y$ both simplify to one by the definition of z-normalization:
$$ \sigma_x = \sqrt{cov(x, x)} $$
$$ \htmlClass{special}{\sigma_x = \sqrt{\sum{x_i^2} \over N} = 1} $$
Pearson correlation
The denominator reduces to one, leaving us with just the covariance:
$$ \operatorname{P}(x, y) = {\operatorname{cov}(x, y) \over \sigma_x\sigma_y} $$
$$ \htmlClass{special}{\operatorname{P}(x, y) = \operatorname{cov}(x, y) = {{\sum{x_iy_i}} \over N}} $$
Euclidean distance
Now let’s look at squared Euclidean distance between z-normalized vectors.
$$ \operatorname{dist^2}(x, y) = \sum{(x_i - y_i)^2} $$
We can expand out the squared differences inside the sum:
$$ {\operatorname{dist^2}(x,y) =}\sum{(x_i^2 - 2x_iy_i + y_i^2)} $$ then separate out the terms into their own sums so we can tackle them one by one:
$$ {\operatorname{dist^2}(x,y) =} \sum{x_i^2} - 2\sum{x_iy_i} + \sum{y_i^2} $$
Let’s examine this. The two sum-of-squares terms, $\sum{x_i^2}$ and $\sum{y_i^2}$, are both $N$ times the squared standard deviation, which reduces to $N$:
Meanwhile, the middle term is $2N$ times $\operatorname{P}(x, y)$:
$$ \htmlClass{light}{\sum{x_i^2} -} \underbrace{2\sum{x_iy_i}}_{2N \text { times } \operatorname{P}(x, y)} \htmlClass{light}{+ \sum{y_i^2}} $$ Putting these pieces together, our formula becomes:
$$ \htmlClass{special}{\operatorname{dist^2}(x, y) = 2N - 2N \cdot \operatorname{P}(x, y)} $$
which can be simplified to:
$$ \htmlClass{special}{\operatorname{dist^2}(x, y) = 2N(1 - \operatorname{P}(x, y))} $$
showing that the squared Euclidean distance between two z-normalized vectors is $2N$ times their Pearson distance.
Why does this matter?
A common trick to improve Euclidean nearest-neighbor searches is to z-normalize the vectors that you’re searching over. The connection to Pearson correlation gives us an intuition for why this works so well in practice.
Additionally, representing correlation as a distance metric allows us to use it with algorithms that rely on properties of a distance metric, such as the triangle inequality. Despite the name, Pearson distance is not actually a distance.
References
On Clustering Time Series Using Euclidean Distance and Pearson Correlation:
For time series comparisons, it has often been observed that z-score normalized Euclidean distances far outperform the unnormalized variant. In this paper we show that a z-score normalized, squared Euclidean Distance is, in fact, equal to a distance based on Pearson Correlation.