pythonscikit-learnpairwise-distance

What does sklearn's pairwise_distances with metric='correlation' do?


I've put different values into this function and observed the output. But I can't find a predictable pattern in what is being outputed.

Then I tried digging through the function itself, but its confusing because it can do a number of different calculations.

According to the Docs:

Compute the distance matrix from a vector array X and optional Y.

I see it returns a matrix of height and width equal to the number of nested lists inputted, implying that it is comparing each one.

But otherwise I'm having a tough time understanding what its doing and where the values are coming from.

Examples I've tried:

pairwise_distances([[1]], metric='correlation')
>>> array([[0.]])

pairwise_distances([[1], [1]], metric='correlation')
>>> array([[ 0., nan],
>>>        [nan,  0.]])

# returns same as last input although input values differ
pairwise_distances([[1], [2]], metric='correlation')
>>> array([[ 0., nan],
>>>        [nan,  0.]])

pairwise_distances([[1,2], [1,2]], metric='correlation')
>>> array([[0.00000000e+00, 2.22044605e-16],
>>>        [2.22044605e-16, 0.00000000e+00]])

# returns same as last input although input values differ
# I incorrectly expected more distance because input values differ more
pairwise_distances([[1,2], [1,3]], metric='correlation')
>>> array([[0.00000000e+00, 2.22044605e-16],
>>>       [2.22044605e-16, 0.00000000e+00]])

Computing correlation distance with Scipy

I don't understand where the sklearn 2.22044605e-16 value is coming from if scipy returns 0.0 for the same inputs.

# Scipy
import scipy
scipy.spatial.distance.correlation([1,2], [1,2])
>>> 0.0

# Sklearn
pairwise_distances([[1,2], [1,2]], metric='correlation')
>>> array([[0.00000000e+00, 2.22044605e-16],
>>>        [2.22044605e-16, 0.00000000e+00]])

I'm not looking for a high level explanation but an example of how the numbers are calculated.


Solution

  • pairwise_distances internally call the distance.pdist(), when y is None(which means we want to compute the distance matrix for each vector in X)

    Reference 1, 2

    The implementation would be similar to the following:

    X = np.array([[1,2], [1,2]])
    
    import numpy as np
    from numpy.linalg import norm
    
    X2 = X - X.mean(axis=1, keepdims=True)
    
    u, v =[*X2]
    
    1 - (sum(u*v)/(norm(u) * norm(v)))
    
    #2.220446049250313e-16
    

    But scipy.spatial.distance.correlation implementation differs in the latest version

    latest version, old version

    If we set the weights to None, the following snippet is the simplified version of it:

    u, v = np.array([1,2]), np.array([1,2])
    
    umu = np.average(u)
    vmu = np.average(v)
    u = u - umu
    v = v - vmu
    uv = np.average(u * v)
    uu = np.average(np.square(u))
    vv = np.average(np.square(v))
    dist = 1.0 - uv / np.sqrt(uu * vv)
    dist
    
    #0