I have a dataset constructed as a sparse weighted matrix for which I want to calculate weighted Jaccard index for downstream grouping/clustering, with inspiration from below article: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf
I'm facing a slight issue in finding the optimal way for doing the above calculation in Python. My current function to test my hypothesis is the following:
def weighted_jaccard_index(x,y):
mins, maxs = 0, 0
for i in np.arange(len(x)):
mins += np.amin([x[i],y[i]])
maxs += np.amax([x[i],y[i]])
return mins/maxs
I then supply this through from scipy.spatial.distance import pdist
by passing my defined function weighted_jaccard_index
:
w_j = pdist(X, weighted_jaccard_index)
But not very surprisingly I am seeing big performance issues.
I am currently investigating using MinHash with the datasketch
package, but I am happy to take input on how this would be best achieved. I think something like this would be possible:
df_np = df_gameinfo.to_numpy()
mg = ds.WeightedMinHashGenerator(20,20000)
lsh = ds.MinHashLSH(threshold=0.2,num_perm=20000)
#Create index in LSH
for i in np.arange(len(df_np)):
vc = df_arr[i]
m_hash = mg.minhash(vc)
lsh.insert(i,m_hash)
test_ex = df.iloc[9522].to_numpy() #Random observations to calculate distance for
test_ex_m = mg.minhash(test_ex)
lsh.query(test_ex_m)
Potentially using vectorization in Numpy, but I am bit unsure on how to phrase it.
The data is of size 20k observations and 30 dimension (NxM = 20.000x30).
You can use concatenate:
q = np.concatenate([x,y], axis=1)
np.sum(np.amin(q,axis=1))/np.sum(np.amax(q,axis=1))
%%timeit -r 10 -n 10 gives 131 µs ± 61.7 µs per loop (mean ± std. dev. of 10 runs, 10 loops each) for 100 by 10 arrays.
your original function gives: 4.46 ms ± 95.9 µs per loop (mean ± std. dev. of 10 runs, 10 loops each) for the same data
def jacc_index(x,y):
q = np.concatenate([x,y], axis=1)
return np.sum(np.amin(q,axis=1))/np.sum(np.amax(q,axis=1))