python tf-idf cosine-similarity tfidfvectorizer

Why does this tf-idf model give 0 similarity?

I have two strings, which are different only slightly:

str1 = 'abcdefgh'
str2 = 'abcdef-gh'

The only difference is that each sub string has a hyphen. But the tf-idf gives 0 similarity:

Code to compute tf-idf similarity is below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def compute_cosine_similarity(str1, str2):
   

    # Create a TF-IDF vectorizer
    vectorizer = TfidfVectorizer()

    # Compute the TF-IDF matrix for the two strings
    tfidf_matrix = vectorizer.fit_transform([string1, string2])

    # Compute the cosine similarity between the two TF-IDF vectors
    similarity_matrix = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

    # Extract the similarity score from the matrix
    similarity_score = similarity_matrix[0][0]

    return similarity_score

similar_columns = compute_similar_columns(df1, df2)

But if I change to:

str1 = 'abcdef-gh'
str2 = 'abcdef-gh'

The similarity is 1. It seems that tf-idf does't like some special symbols in one side of the strings, like '-'

Why is that?

Solution

If you examine the vocabulary of your fitted Vectorizer instance the scoring makes sense.

print(vectorizer.vocabulary_)
-> {'abcdefgh': 1, 'abcdef': 0, 'gh': 2}

And in addition the document matrix:

print(tfidf_matrix.toarray())
-> array([[0.        , 1.        , 0.        ],
          [0.70710678, 0.        , 0.70710678]])

Document str1 consists of one word, document str2 of two different words. Therefore the two document vectors do not have any similarity.

However, you can change the analyzer of your Vectorizer from word to character level like so:

str1 = 'abcdefgh'
str2 = 'abcdef-gh'

vectorizer = TfidfVectorizer(analyzer="char")
tfidf_matrix = vectorizer.fit_transform([str1, str2])
similarity_matrix = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
similarity_score = similarity_matrix[0][0]

print(similarity_score)
-> 0.8955324150715728

This result might be closer to what you expected.