pythonsimilarityembeddingcosine-similaritysentence-similarity

Calculate Cosine Similarity Sentences ValueError: Expected 2D array, got 1D array instead


So I'm doing a cosine similarity calculation on a list of sentences. I've got the embedding of the calculations done.

Here's the embedding

The shape of embedding (11, 3072)
[[-0.02179624 -0.17235152 -0.14017016 ...  0.33180898  0.13701975
  -0.2275123 ]
 [ 0.08176168  0.03396776 -0.00361721 ... -0.06099782 -0.1941497
   0.16414282]
 [ 0.01786027 -0.07074962  0.08268858 ... -0.15433213  0.22098969
  -0.05902294]
 ...
 [-0.33807683  0.06110802  0.32764304 ...  0.07062552 -0.2734855
  -0.01919978]
 [-0.09536518  0.04956777  0.64503926 ... -0.11085486 -0.36796266
   0.2826454 ]
 [-0.12355942 -0.1552269  -0.01554828 ... -0.14761439  0.17142747
  -0.02176587]]

and here's an example sentence.

document1 = ["sentence a", "sentence b", "sentence c", ...] # There are 11 sentence

I tried to calculate the similarity of each sentence using cosine similarity

# Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
sentences_2d = np.array(document1).reshape(-1,1)
similarity_matrix = np.zeros([len(document1), len(document1)])
for i in range(len(sentences_2d)):
  for j in range(len(sentences_2d)):
    if i != j:
      similarity_matrix[i][j] = cosine_similarity(arrcatembed[i], arrcatembed[j])

When I do a similarity calculation, I get an error like this,

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-e15cce98d633> in <module>
      6   for j in range(len(sentences_2d)):
      7     if i != j:
----> 8       similarity_matrix[i][j] = cosine_similarity(arrcatembed[i], arrcatembed[j])

2 frames
/usr/local/lib/python3.9/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    900             # If input is 1D raise error
    901             if array.ndim == 1:
--> 902                 raise ValueError(
    903                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    904                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=[-0.02179624 -0.17235152 -0.14017016 ...  0.33180898  0.13701975
 -0.2275123 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

can anyone help to solve this problem? Thank You

So I want each sentence to get similarity results between the sentences in the list. For example, the first sentence with the second to eleven sentences, the second sentence with the first sentence, the third to eleven. Etc. As I have done with cosine distance

The shape (11, 11)
The length 11
[[1.         0.90366799 0.92140669 0.90678644 0.88496917 0.89278495
  0.93188739 0.87325549 0.88947386 0.86656564 0.90396279]
 [0.90366799 1.         0.91544878 0.95543408 0.93818021 0.94250894
  0.93432641 0.93418741 0.92931563 0.9156481  0.91719031]
 [0.92140669 0.91544878 1.         0.92346388 0.91356987 0.93290257
  0.94972414 0.90773791 0.92120057 0.90897304 0.92319667]
 [0.90678644 0.95543408 0.92346388 1.         0.94258463 0.95669407
  0.94972783 0.93550926 0.93902498 0.93075407 0.92586052]
 [0.88496917 0.93818021 0.91356987 0.94258463 1.         0.95144665
  0.92863572 0.95595235 0.9522922  0.94791383 0.94201249]
 [0.89278495 0.94250894 0.93290257 0.95669407 0.95144665 1.
  0.95301741 0.95989478 0.95237011 0.94007719 0.93626297]
 [0.93188739 0.93432641 0.94972414 0.94972783 0.92863572 0.95301741
  1.         0.92727625 0.93515086 0.92043686 0.92175251]
 [0.87325549 0.93418741 0.90773791 0.93550926 0.95595235 0.95989478
  0.92727625 1.         0.96572489 0.95371407 0.92973185]
 [0.88947386 0.92931563 0.92120057 0.93902498 0.9522922  0.95237011
  0.93515086 0.96572489 1.         0.95132333 0.9478088 ]
 [0.86656564 0.9156481  0.90897304 0.93075407 0.94791383 0.94007719
  0.92043686 0.95371407 0.95132333 1.         0.92758161]
 [0.90396279 0.91719031 0.92319667 0.92586052 0.94201249 0.93626297
  0.92175251 0.92973185 0.9478088  0.92758161 1.        ]]

Solution

  • cosine_similarity expects input of shape (n_samples, n_features) and it returns 2d array of shape (n_samples, n_samples) so you don't have to use this nested loop - it already does it.

    Your code should look like:

    similarity_matrix = cosine_similarity(embeddings)