I have dataframe with 1000 text rows.
I did TfidfVectorizer.
Now I want to create a new field which give me the distance from each sentence to the word that i want, lets say the word "king". df['king']
I thought about taking in each sentence the 5 closet words to the word king and make average of them.
I will glad to know how to do that or to hear about another method.
I am not convinced that the Euclidean distance would be the optimal measure. I would actually look at similarity scores:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
data = {
'text': [
"The king sat on the throne with wisdom.",
"A queen ruled the kingdom alongside the king.",
"Knights were loyal to their king.",
"The empire prospered under the rule of a wise monarch."
]
}
df = pd.DataFrame(data)
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['text'])
try:
king_vector = tfidf.transform(["king"]).toarray()
except KeyError:
print("The word 'king' is not in the vocabulary.")
king_vector = np.zeros((1, tfidf_matrix.shape[1]))
similarities = cosine_similarity(tfidf_matrix, king_vector).flatten()
feature_names = np.array(tfidf.get_feature_names_out())
def get_top_n_words(row_vector, top_n=5):
indices = row_vector.argsort()[::-1][:top_n]
return feature_names[indices]
averages = []
for i in range(tfidf_matrix.shape[0]):
sentence_vector = tfidf_matrix[i].toarray().flatten()
top_words = get_top_n_words(sentence_vector)
top_similarities = [cosine_similarity(tfidf.transform([word]), king_vector).flatten()[0] for word in top_words]
averages.append(np.mean(top_similarities))
df['king_similarity'] = similarities
df['avg_closest_similarity'] = averages
print(df)
which would give you
text king_similarity \
0 The king sat on the throne with wisdom. 0.240614
1 A queen ruled the kingdom alongside the king. 0.259779
2 Knights were loyal to their king. 0.274487
3 The empire prospered under the rule of a wise ... 0.000000
avg_closest_similarity
0 0.0
1 0.0
2 0.0
3 0.0
That being said, if you absolutely want to focus on Euclidean distance, here is a method:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.spatial.distance import euclidean
data = {
'text': [
"The king sat on the throne with wisdom.",
"A queen ruled the kingdom alongside the king.",
"Knights were loyal to their king.",
"The empire prospered under the rule of a wise monarch."
]
}
df = pd.DataFrame(data)
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['text']).toarray()
feature_names = tfidf.get_feature_names_out()
if "king" in feature_names:
king_index = np.where(feature_names == "king")[0][0]
king_vector = np.zeros_like(tfidf_matrix[0])
king_vector[king_index] = 1
else:
print("The word 'king' is not in the vocabulary.")
king_vector = np.zeros_like(tfidf_matrix[0])
df['king_distance'] = [euclidean(sentence_vector, king_vector) for sentence_vector in tfidf_matrix]
print(df)
which gives
text king_distance
0 The king sat on the throne with wisdom. 1.232385
1 A queen ruled the kingdom alongside the king. 1.216734
2 Knights were loyal to their king. 1.204586
3 The empire prospered under the rule of a wise ... 1.414214