So I'm doing a cosine similarity calculation on a list of sentences. I've got the embedding of the calculations done.
Here's the embedding
The shape of embedding (11, 3072)
[[-0.02179624 -0.17235152 -0.14017016 ... 0.33180898 0.13701975
-0.2275123 ]
[ 0.08176168 0.03396776 -0.00361721 ... -0.06099782 -0.1941497
0.16414282]
[ 0.01786027 -0.07074962 0.08268858 ... -0.15433213 0.22098969
-0.05902294]
...
[-0.33807683 0.06110802 0.32764304 ... 0.07062552 -0.2734855
-0.01919978]
[-0.09536518 0.04956777 0.64503926 ... -0.11085486 -0.36796266
0.2826454 ]
[-0.12355942 -0.1552269 -0.01554828 ... -0.14761439 0.17142747
-0.02176587]]
and here's an example sentence.
document1 = ["sentence a", "sentence b", "sentence c", ...] # There are 11 sentence
I tried to calculate the similarity of each sentence using cosine similarity
# Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
sentences_2d = np.array(document1).reshape(-1,1)
similarity_matrix = np.zeros([len(document1), len(document1)])
for i in range(len(sentences_2d)):
for j in range(len(sentences_2d)):
if i != j:
similarity_matrix[i][j] = cosine_similarity(arrcatembed[i], arrcatembed[j])
When I do a similarity calculation, I get an error like this,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-46-e15cce98d633> in <module>
6 for j in range(len(sentences_2d)):
7 if i != j:
----> 8 similarity_matrix[i][j] = cosine_similarity(arrcatembed[i], arrcatembed[j])
2 frames
/usr/local/lib/python3.9/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
900 # If input is 1D raise error
901 if array.ndim == 1:
--> 902 raise ValueError(
903 "Expected 2D array, got 1D array instead:\narray={}.\n"
904 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
array=[-0.02179624 -0.17235152 -0.14017016 ... 0.33180898 0.13701975
-0.2275123 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
can anyone help to solve this problem? Thank You
So I want each sentence to get similarity results between the sentences in the list. For example, the first sentence with the second to eleven sentences, the second sentence with the first sentence, the third to eleven. Etc. As I have done with cosine distance
The shape (11, 11)
The length 11
[[1. 0.90366799 0.92140669 0.90678644 0.88496917 0.89278495
0.93188739 0.87325549 0.88947386 0.86656564 0.90396279]
[0.90366799 1. 0.91544878 0.95543408 0.93818021 0.94250894
0.93432641 0.93418741 0.92931563 0.9156481 0.91719031]
[0.92140669 0.91544878 1. 0.92346388 0.91356987 0.93290257
0.94972414 0.90773791 0.92120057 0.90897304 0.92319667]
[0.90678644 0.95543408 0.92346388 1. 0.94258463 0.95669407
0.94972783 0.93550926 0.93902498 0.93075407 0.92586052]
[0.88496917 0.93818021 0.91356987 0.94258463 1. 0.95144665
0.92863572 0.95595235 0.9522922 0.94791383 0.94201249]
[0.89278495 0.94250894 0.93290257 0.95669407 0.95144665 1.
0.95301741 0.95989478 0.95237011 0.94007719 0.93626297]
[0.93188739 0.93432641 0.94972414 0.94972783 0.92863572 0.95301741
1. 0.92727625 0.93515086 0.92043686 0.92175251]
[0.87325549 0.93418741 0.90773791 0.93550926 0.95595235 0.95989478
0.92727625 1. 0.96572489 0.95371407 0.92973185]
[0.88947386 0.92931563 0.92120057 0.93902498 0.9522922 0.95237011
0.93515086 0.96572489 1. 0.95132333 0.9478088 ]
[0.86656564 0.9156481 0.90897304 0.93075407 0.94791383 0.94007719
0.92043686 0.95371407 0.95132333 1. 0.92758161]
[0.90396279 0.91719031 0.92319667 0.92586052 0.94201249 0.93626297
0.92175251 0.92973185 0.9478088 0.92758161 1. ]]
cosine_similarity
expects input of shape (n_samples, n_features)
and it returns 2d array of shape (n_samples, n_samples)
so you don't have to use this nested loop - it already does it.
Your code should look like:
similarity_matrix = cosine_similarity(embeddings)