pythonnumpypytorchsentence-transformers

Python Sentence Transformer - Get matching Sentence by index order


I have a database table with lot of records. And I am comparing the sentence to find a best match.

lets say the table contains 4 columns: id, sentence, info, updated_date. The data contains as below:

id sentence info updated_info_date
1 What is the name of your company some distinct info 19/12/2022
2 Company Name some distinct info 18/12/2022
3 What is the name of your company some distinct info 17/12/2022
4 What is the name of your company some distinct info 16/12/2022
5 What is the name of your company some distinct info 15/12/2022
6 What is the name of your company some distinct info 14/12/2022
7 What is the name of your company some distinct info 13/12/2022
8 What is the phone number of your company some distinct info 12/12/2022
9 What is the name of your company some distinct info 11/12/2022
10 What is the name of your company some distinct info 10/12/2022

I have converted these sentences to tensors.

And I am passing this as an example "What is the name of your company"(tensor) to match.

sentence = "What is the name of your company" # in tensor format
cos_scores = util.pytorch_cos_sim(sentence, all_sentences_tensors)[0]

top_results = torch.topk(cos_scores, k=5) 
or
top_results = np.argpartition(cos_scores, range(5))[0:5]

top_results does not return the top results index wise.
As the sentences are same, all will have a score of "1". And it returns the results arbitrarily.

What I want is to get the top 5 matches with the latest updated_date order or the index order.

Is this possible to achieve ?

Any suggestions ?


Solution

  • What I would do is as follows:

    1. Get the cosine similarity scores for each sentence and store them in an array.
    2. Sort the array based on the updated_date
    3. Get the top 5 indices from the sorted array
    4. Get the corresponding sentences from the database table using the indices
    5. This should give you the top 5 matches with the latest updated_date or the index order. Something like this should work:
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Create a CountVectorizer object to transform the sentences into vectors
    vectorizer = CountVectorizer()
    
    # Transform the sentences into vectors using the CountVectorizer object
    vectors = vectorizer.fit_transform(sentences)
    
    # Calculate the cosine similarity scores using the cosine_similarity function
    cosine_scores = cosine_similarity(vectors)
    
    # Convert the cosine similarity scores to a 1-dimensional numpy array
    cosine_scores = cosine_scores.flatten()
    
    # Sort the array of cosine similarity scores in ascending order
    sorted_indices = cosine_scores.argsort()
    
    # Get the top 5 indices from the sorted array
    top_5_indices = sorted_indices[-5:]
    
    # Get the corresponding sentences from the database table using the indices
    top_5_sentences = [sentences[i] for i in top_5_indices]