pythonnumpymatrixscipyscipy-spatial

Why are there discrepanices when generating a distance matrix with scipy pdist(metric = 'jaccard') vs scipy jaccard?


I am comparing the Jaccard distance matrix I get when I process a dataset using pdist and a DIY Jaccard distance matrix function. I'm getting different results in my output distance matrices and I'm not sure why.

I think one of the following is the cause:

The docs for squareform go a bit over my head so some form of normalisation might be what's happening. However, the squareform-ed distance matrix does not have the same relative distance magnitudes between cells which is confusing (e.g. row 0 in my DIY distance matrix is 0, 0.571429, 1, and with pdist is 0, 1, 1 - the middle value is twice as high with pdist).

Can anyone explain the why I'm getting a different distance matrix when it's being analysed with the same metric?

My code:

import numpy as np
from scipy.spatial.distance import jaccard, squareform, pdist

def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
    #I don't care about every value in the array for my use case, so dont want to include them in my comparison
    all_features = set([i for i in feature_list1 if i != filler_val])
    all_features.update(set([i for i in feature_list2 if i != filler_val]))
    counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
    counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
    return jaccard(counts_1, counts_2)
 
    
data_array = np.array([[1, 2, 3, 4, 5],
                      [3, 4, 5, 6, 7],
                      [8, 9, 10, 11, 12]])

# =============================================================================
# DIY distance matrix
# =============================================================================
#set filler val to None, so the arrays being compared are equivalent to pdist
dist_diy = np.array([[jaccard_dissimilarity(a,b, None) for a in data_array] for b in data_array])

# =============================================================================
# pdist distance matrix
# =============================================================================
dist_pdist = squareform(pdist(data_array, metric = 'jaccard'))

Input array:

1   2   3   4   5
3   4   5   6   7
8   9   10  11  12

dist_diy:

0           0.571429    1
0.571429    0           1
1           1           0

dist_pdist:

0   1   1
1   0   1
1   1   0

Solution

  • Looks like pdist considers objects at a given index when comparing arrays, rather than just what objects are present in the array itself - if I change data_array[1] to 3, 4, 5, 4, 5 then the distance matrix changes to reflect the fact that data_array[0][3:5] == data_array[1][3:5]:

    0   0.6 1
    0.6 0   1
    1   1   0
    

    The behaviour is discussed here, but the arrays don't have to be boolean based on the above tests (if the arrays were treated as boolean then the distance matrix would not change as all numbers are > 1 and are therefore == True).

    The DIY function considered the objects present rather than the index at which those objects were found, hence the discrepancy!