I have two strings, which are different only slightly:
str1 = 'abcdefgh'
str2 = 'abcdef-gh'
The only difference is that each sub string has a hyphen. But the tf-idf gives 0 similarity:
Code to compute tf-idf similarity is below:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def compute_cosine_similarity(str1, str2):
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Compute the TF-IDF matrix for the two strings
tfidf_matrix = vectorizer.fit_transform([string1, string2])
# Compute the cosine similarity between the two TF-IDF vectors
similarity_matrix = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
# Extract the similarity score from the matrix
similarity_score = similarity_matrix[0][0]
return similarity_score
similar_columns = compute_similar_columns(df1, df2)
But if I change to:
str1 = 'abcdef-gh'
str2 = 'abcdef-gh'
The similarity is 1. It seems that tf-idf does't like some special symbols in one side of the strings, like '-'
Why is that?
If you examine the vocabulary of your fitted Vectorizer instance the scoring makes sense.
print(vectorizer.vocabulary_)
-> {'abcdefgh': 1, 'abcdef': 0, 'gh': 2}
And in addition the document matrix:
print(tfidf_matrix.toarray())
-> array([[0. , 1. , 0. ],
[0.70710678, 0. , 0.70710678]])
Document str1
consists of one word, document str2
of two different words. Therefore the two document vectors do not have any similarity.
However, you can change the analyzer of your Vectorizer from word to character level like so:
str1 = 'abcdefgh'
str2 = 'abcdef-gh'
vectorizer = TfidfVectorizer(analyzer="char")
tfidf_matrix = vectorizer.fit_transform([str1, str2])
similarity_matrix = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
similarity_score = similarity_matrix[0][0]
print(similarity_score)
-> 0.8955324150715728
This result might be closer to what you expected.