pythondeep-learningnlpnltksentence-similarity

is there a way to check similarity between two full sentences in python?


I am making a project like this one here: https://www.youtube.com/watch?v=dovB8uSUUXE&feature=youtu.be but i am facing trouble because i need to check the similarity between the sentences for example: if the user said: 'the person wear red T-shirt' instead of 'the boy wear red T-shirt' I want a method to check the similarity between these two sentences without having to check the similarity between each word is there a way to do this in python?

I am trying to find a way to check the similarity between two sentences.


Solution

  • Most of there libraries below should be good choice for semantic similarity comparison. You can skip direct word comparison by generating word, or sentence vectors using pretrained models from these libraries.

    Sentence similarity with Spacy

    Required models must be loaded first.

    For using en_core_web_md use python -m spacy download en_core_web_md to download. For using en_core_web_lg use python -m spacy download en_core_web_lg.

    The large model is around ~830mb as writing and quite slow, so medium one can be a good choice.

    https://spacy.io/usage/vectors-similarity/

    Code:

    import spacy
    nlp = spacy.load("en_core_web_lg")
    #nlp = spacy.load("en_core_web_md")
    
    
    doc1 = nlp(u'the person wear red T-shirt')
    doc2 = nlp(u'this person is walking')
    doc3 = nlp(u'the boy wear red T-shirt')
    
    
    print(doc1.similarity(doc2)) 
    print(doc1.similarity(doc3))
    print(doc2.similarity(doc3)) 
    

    Output:

    0.7003971105290047
    0.9671912343259517
    0.6121211244876517
    

    Sentence similarity with Sentence Transformers

    https://github.com/UKPLab/sentence-transformers

    https://www.sbert.net/docs/usage/semantic_textual_similarity.html

    Install with pip install -U sentence-transformers. This one generates sentence embedding.

    Code:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('distilbert-base-nli-mean-tokens')
    
    sentences = [
        'the person wear red T-shirt',
        'this person is walking',
        'the boy wear red T-shirt'
        ]
    sentence_embeddings = model.encode(sentences)
    
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding:", embedding)
        print("")
    

    Output:

    Sentence: the person wear red T-shirt
    Embedding: [ 1.31643847e-01 -4.20616418e-01 ... 8.13076794e-01 -4.64620918e-01]
    
    Sentence: this person is walking
    Embedding: [-3.52878094e-01 -5.04286848e-02 ... -2.36091137e-01 -6.77282438e-02]
    
    Sentence: the boy wear red T-shirt
    Embedding: [-2.36365378e-01 -8.49713564e-01 ... 1.06414437e+00 -2.70157874e-01]
    

    Now embedding vector can be used to calculate various similarity metrics.

    Code:

    from sentence_transformers import SentenceTransformer, util
    print(util.pytorch_cos_sim(sentence_embeddings[0], sentence_embeddings[1]))
    print(util.pytorch_cos_sim(sentence_embeddings[0], sentence_embeddings[2]))
    print(util.pytorch_cos_sim(sentence_embeddings[1], sentence_embeddings[2]))
    

    Output:

    tensor([[0.4644]])
    tensor([[0.9070]])
    tensor([[0.3276]])
    

    Same thing with scipy and pytorch,

    Code:

    from scipy.spatial import distance
    print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[1]))
    print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[2]))
    print(1 - distance.cosine(sentence_embeddings[1], sentence_embeddings[2]))
    

    Output:

    0.4643629193305969
    0.9069876074790955
    0.3275738060474396
    

    Code:

    import torch.nn
    cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
    b = torch.from_numpy(sentence_embeddings)
    print(cos(b[0], b[1]))
    print(cos(b[0], b[2]))
    print(cos(b[1], b[2]))
    

    Output:

    tensor(0.4644)
    tensor(0.9070)
    tensor(0.3276)
    

    Sentence similarity with TFHub Universal Sentence Encoder

    https://tfhub.dev/google/universal-sentence-encoder/4

    https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

    Model is very large for this one around 1GB and seems slower than others. This also generates embeddings for sentences.

    Code:

    import tensorflow_hub as hub
    
    embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
    embeddings = embed([
        "the person wear red T-shirt",
        "this person is walking",
        "the boy wear red T-shirt"
        ])
    
    print(embeddings)
    

    Output:

    tf.Tensor(
    [[ 0.063188    0.07063895 -0.05998802 ... -0.01409875  0.01863449
       0.01505797]
     [-0.06786212  0.01993554  0.03236153 ...  0.05772103  0.01787272
       0.01740014]
     [ 0.05379306  0.07613157 -0.05256693 ... -0.01256405  0.0213196
      -0.00262441]], shape=(3, 512), dtype=float32)
    

    Code:

    from scipy.spatial import distance
    print(1 - distance.cosine(embeddings[0], embeddings[1]))
    print(1 - distance.cosine(embeddings[0], embeddings[2]))
    print(1 - distance.cosine(embeddings[1], embeddings[2]))
    

    Output:

    0.15320375561714172
    0.8592830896377563
    0.09080004692077637
    

    Other Sentence Embedding Libraries

    https://github.com/facebookresearch/InferSent

    https://github.com/Tiiiger/bert_score

    This illustration shows the method,

    enter image description here

    Resources

    How to compute the similarity between two text documents?

    https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity

    https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632

    https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html

    https://www.tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity

    https://nlp.town/blog/sentence-similarity/