pythontensorflowscikit-learnsentence-similarity

Finding most similar sentences among all in python


Suggestions / refer links /codes are appreciated.

I have a data which is having more than 1500 rows. Each row has a sentence. I am trying to find out the best method to find the most similar sentences among all.

What I have tried

  1. I have tried K-mean algorithm which groups similar sentences in a cluster. But I found a drawback in which I have to pass K to create a cluster. It is hard to guess K. I tried elbo method to guess the clusters but grouping all together isn't sufficient. In this approach I am getting all the data grouped. I am looking for data which is similar above 0.90% data should be returned with ID.

  2. I tried cosine similarity in which I used TfidfVectorizer to create matrix and then passed in cosine similarity. Even this approach didn't worked properly.

What I am looking for

I want an approach where I can pass a threshold example 0.90 data in all rows which are similar to each other above 0.90% should be returned as a result.

Data Sample
ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN   
11    | MAXPREDO Validation is corect
12    | Move to QC  
13    | Cancel ASN WMS Cancel ASN   
14    | MAXPREDO Validation is right
15    | Verify files are sent every hours for this interface from Optima
16    | MAXPREDO Validation are correct
17    | Move to QC  
18    | Verify files are not sent

Expected result

Above data which are similar upto 0.90% should get as a result with ID

ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN
13    | Cancel ASN WMS Cancel ASN
11    | MAXPREDO Validation is corect  # even spelling is not correct
14    | MAXPREDO Validation is right
16    | MAXPREDO Validation are correct
12    | Move to QC  
17    | Move to QC  

Solution

  • Why did it not work for you with cosine similarity and the TFIDF-vectorizer?

    I tried it and it works with this code:

    import pandas as pd
    import numpy as np
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
                                                                    [11,"MAXPREDO Validation is corect"],
                                                                    [12,"Move to QC"],
                                                                    [13,"Cancel ASN WMS Cancel ASN"],
                                                                    [14,"MAXPREDO Validation is right"],
                                                                    [15,"Verify files are sent every hours for this interface from Optima"],
                                                                    [16,"MAXPREDO Validation are correct"],
                                                                    [17,"Move to QC"],
                                                                    [18,"Verify files are not sent"]
                                                                    ]))
    
    corpus = list(df["DESCRIPTION"].values)
    
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    
    threshold = 0.4
    
    for x in range(0,X.shape[0]):
      for y in range(x,X.shape[0]):
        if(x!=y):
          if(cosine_similarity(X[x],X[y])>threshold):
            print(df["ID"][x],":",corpus[x])
            print(df["ID"][y],":",corpus[y])
            print("Cosine similarity:",cosine_similarity(X[x],X[y]))
            print()
    

    The threshold can be adjusted as well, but will not yield the results you want with a threshold of 0.9.

    The output for a threshold of 0.4 is:

    10 : Cancel ASN WMS Cancel ASN
    13 : Cancel ASN WMS Cancel ASN
    Cosine similarity: [[1.]]
    
    11 : MAXPREDO Validation is corect
    14 : MAXPREDO Validation is right
    Cosine similarity: [[0.64183024]]
    
    12 : Move to QC
    17 : Move to QC
    Cosine similarity: [[1.]]
    
    15 : Verify files are sent every hours for this interface from Optima
    18 : Verify files are not sent
    Cosine similarity: [[0.44897995]]
    

    With a threshold of 0.39 all your expected sentences are features in the output, but an additional pair with the indices [15,18] can be found as well:

    10 : Cancel ASN WMS Cancel ASN
    13 : Cancel ASN WMS Cancel ASN
    Cosine similarity: [[1.]]
    
    11 : MAXPREDO Validation is corect
    14 : MAXPREDO Validation is right
    Cosine similarity: [[0.64183024]]
    
    11 : MAXPREDO Validation is corect
    16 : MAXPREDO Validation are correct
    Cosine similarity: [[0.39895808]]
    
    12 : Move to QC
    17 : Move to QC
    Cosine similarity: [[1.]]
    
    14 : MAXPREDO Validation is right
    16 : MAXPREDO Validation are correct
    Cosine similarity: [[0.39895808]]
    
    15 : Verify files are sent every hours for this interface from Optima
    18 : Verify files are not sent
    Cosine similarity: [[0.44897995]]