For the purposes of keeping it simple I have four vectors -- W, X, Y, Z -- that contain a number of values (each the same length). I'm trying to calculate cosine similarity across them pairwise in Python, but I can't seem to get the right answer.
If I try comparing W vs. X:
print(np.dot(W, X.T)/(np.linalg.norm(W)*np.linalg.norm(X)))
I get the following result:
[[0.9984622004973391]]
If I compare W vs. Y I get:
[[0.8891911653057049]]
And if I compare W to Z I get:
[[0.9676746591879851]]
I of course don't want to do these manually one by one, however, as I have many vectors in reality.
When I try to calculate all three (X, Y, Z) vs. W at once:
V = pd.concat([X, Y, Z])
print(np.dot(W, V.T)/(np.linalg.norm(W)*np.linalg.norm(V)))
I get the following:
[[0.9982175434442747 0.005561082504669956 0.020547860729214433]]
...where the first nearly matches what I had gotten running them singularly (but still not quite), while the others are way off.
I must have an issue with my approach to the all at once version, but I have not been able to figure out how to fix it. Any ideas? Thanks!
When you execute np.dot(W, V.T)
, gets three values like
[[3.9353 2.4442 2.418 ]]
For each value, you must have a different normalization (for X
, Y
, Z
), when you call np.linalg.norm(V)
you get just one value (norm of Matrix V
). To calculate the norm for each of the vectors (located in each line), you must add the parameter axis=1
.
Finnaly the correct and short code looks like this:
V = np.concatenate([X, Y, Z])
cos_sim = (W @ V.T)/(np.linalg.norm(W)*np.linalg.norm(V, axis=1))
print(cos_sim)