I am trying to compute similarity between two samples.
The python functions sklearn.metrics.pairwise.cosine_similarity
and
scipy.spatial.distance.cosine
return results that I am not satisfied with. For example:
In the following I would have expected 0.0%, because the two samples do not have identical samples.
tt1 = [1, 16, 4, 21]
tt2 = [5, 17, 3, 22]
from scipy import spatial
res = 1-spatial.distance.cosine(tt1, tt2)
print(res)
0.9893593529663931
I would have expected 0.25% of similarity because only a single sample, the first one (1), in both arrays are the same.
tt1 = [1, 16, 4, 21]
tt2 = [1, 17, 3, 22]
from scipy import spatial
res = 1-spatial.distance.cosine(tt1, tt2)
print(res)
0.9990578001169402
In the same way we have the following where I would expect 0.5% was expected. Two identical samples (1 and 16)
tt1 = [1, 16, 4, 21]
tt2 = [1, 16, 3, 22]
res = 0.9989359418266097
Here 0.75% was expected. Three identical samples (1, 16 and 4)
tt1 = [1, 16, 4, 21]
tt2 = [1, 16, 4, 22]
res = 0.9997474232272052
Is there a way in python to achieve those expected results ?
I think you are misunderstanding what the function computes. By your description you want to compute the misclassfication error / accuracy. However, the function receives two samples u,v and computes the cosine distance between them. In your first examples:
tt1 = [1, 16, 4, 21]
tt2 = [5, 17, 3, 22]
then u=tt1 and v=tt2. The different values of the two arrays are the coordinates in the vector space these samples are in (here a 4 dimensional space) - and not different samples. Refer to function documentation and specifically to the examples at the bottom.
If each coordinate in these arrays represent a different sample then:
If order matters: (consider working with numpy arrays to begin with)
np.mean(np.array(tt1) == np.array(tt2))
If order does not matter:
len(np.intersect1d(np.array(tt1), np.array(tt2))) / len(tt1)