i have a dataframe that has a column of lists of string ids. (see below). I want to create a distance matrix between all pairwise "distances" between all the rows (e.g. if 10 rows, then it's a 10x 10 matrix). the rows are lists of ids, so I'm not sure how things like pdist can be used.
the values are string ids. just like string names
ids
0 [58545-19, 462423-43, 277581-25]
1 [0]
2 [454950-82, 433701-46, 228790-63, 266250-52, 458759-98, 152986-78, 222217-39, 433515-16, 265589-83, 439403-23, 277892-38, 223497-19, 224072-83, 461887-57, 436147-12, 227479-78, 228893-32, 279415-18, 439426-27, 437742-46, 438156-73, 438458-68, 277898-05, 438675-76, 454658-95, 431222-77, 462579-94, 434939-86, 222211-09, 178215-13, 459566-11, 463200-04, 439278-94, 459505-18, 399139-66, 455735-62, 327382-03, 439040-62, 233779-51, 431387-38, 438589-72, 437892-49, 458178-76]
3 [431380-63]
4 [442539-01, 434388-16, 454950-82, 463197-61, 228893-32, 464322-07, 462579-94, 438781-51, 437273-11, 265395-79, 463560-76, 462525-31, 439426-27, 438458-68, 464300-38, 442676-80]
5 [234729-10, 435926-98, 416670-04, 179514-28]
6 [0]
7 [0]
8 [267726-25, 235217-71, 227314-72, 185293-18, 434447-56, 170271-19, 454661-20]
9 [0]
Here is a solution using the scipy.spatial.distance.pdist
function to compute the pairwise distances (see full code at the end).
While scipy.spatial.distance
has a jaccard
method, this one is made for boolean arrays. We will need to define a custom function (using this definition of the jaccard distance: 1-intersection/union
):
def jaccard(u, v):
u,v = set(u[0]), set(v[0]) # pdist will pass 2D data [[a,b,c]], so we need to slice
return 1-len(u.intersection(v))/len(u.union(v))
Then we apply it on our dataframe column.
Warning: pdist
expects a multidimensional array as input (Series won't work), so we need to slice the column as DataFrame (df[['ids']]
). Also, passing directly the function as metric
would cause an error as the function is not vectorized (see comment on that point below), so we need to wrap it in a lambda.
pdist(df[['ids']], metric=lambda u,v: jaccard(u,v))
As mentioned above, it is also possible to pass a vectorized function instead. For this, we can use numpy.vectorize
. Note that the function is slightly different than previously. Here we do not slice the first element of the passed values as it is already 1D.
def jaccard(u, v):
u,v = set(u), set(v)
return 1-len(u.intersection(v))/len(u.union(v))
pdist(df[['ids']], metric=np.vectorize(jaccard))
NB. A quick test on the provided dataset showed that the vectorized approach is actually slower than the lambda.
Finally, we transform the output back to matrix using scipy.spatial.distance.squareform
and the pandas.DataFrame
constructor:
pd.DataFrame(squareform(pdist(df[['ids']], metric=lambda u,v: jaccard(u,v))))
Let's start from this input:
df = pd.DataFrame([[['58545-19', '462423-43', '277581-25']],
[['0']],
[['454950-82', '433701-46', '228790-63', '266250-52', '458759-98', '152986-78', '222217-39', '433515-16', '265589-83', '439403-23', '277892-38', '223497-19', '224072-83', '461887-57', '436147-12', '227479-78', '228893-32', '279415-18', '439426-27', '437742-46', '438156-73', '438458-68', '277898-05', '438675-76', '454658-95', '431222-77', '462579-94', '434939-86', '222211-09', '178215-13', '459566-11', '463200-04', '439278-94', '459505-18', '399139-66', '455735-62', '327382-03', '439040-62', '233779-51', '431387-38', '438589-72', '437892-49', '458178-76']],
[['431380-63']],
[['442539-01', '434388-16', '454950-82', '463197-61', '228893-32', '464322-07', '462579-94', '438781-51', '437273-11', '265395-79', '463560-76', '462525-31', '439426-27', '438458-68', '464300-38', '442676-80']],
[['234729-10', '435926-98', '416670-04', '179514-28']],
[['0']],
[['0']],
[['267726-25', '235217-71', '227314-72', '185293-18', '434447-56', '170271-19', '454661-20']],
[['0']],
], columns=['ids'])
from scipy.spatial.distance import pdist, squareform
def jaccard(u, v):
u,v = set(u[0]), set(v[0])
return 1-len(u.intersection(v))/len(u.union(v))
pd.DataFrame(squareform(pdist(df[['ids']], metric=lambda u,v: jaccard(u,v))))
output:
0 1 2 3 4 5 6 7 8 9
0 0.0 1.0 1.000000 1.0 1.000000 1.0 1.0 1.0 1.0 1.0
1 1.0 0.0 1.000000 1.0 1.000000 1.0 0.0 0.0 1.0 0.0
2 1.0 1.0 0.000000 1.0 0.907407 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.000000 0.0 1.000000 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 0.907407 1.0 0.000000 1.0 1.0 1.0 1.0 1.0
5 1.0 1.0 1.000000 1.0 1.000000 0.0 1.0 1.0 1.0 1.0
6 1.0 0.0 1.000000 1.0 1.000000 1.0 0.0 0.0 1.0 0.0
7 1.0 0.0 1.000000 1.0 1.000000 1.0 0.0 0.0 1.0 0.0
8 1.0 1.0 1.000000 1.0 1.000000 1.0 1.0 1.0 0.0 1.0
9 1.0 0.0 1.000000 1.0 1.000000 1.0 0.0 0.0 1.0 0.0
Here is a graphical representation of the distances for the provided dataset (white = further away):