I used TfidfVectorizer
to extract TF-IDF but don't know how it calculates the results. When I calculate it manually, I get a different answer, so I want to extract the values that the function calculates in order to learn how it works.
data = ['Souvenir shop|Architecture and art|Culture and history', 'Souvenir shop|Resort|Diverse cuisine|Fishing|Folk games|Beautiful scenery', 'Diverse cuisine|Resort|Beautiful scenery']
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data)
Have a look in the scikit documentation at the attributes
section.
Try this:
print(vectorizer.vocabulary_)
Output
{'souvenir': 14,
'shop': 13,
'architecture': 1,
'and': 0,
'art': 2,
'culture': 5,
'history': 10,
'resort': 11,
'diverse': 6,
'cuisine': 4,
'fishing': 7,
'folk': 8,
'games': 9,
'beautiful': 3,
'scenery': 12}
You get the idf calculations with print(vectorizer.idf_)
Output
array([1.69314718, 1.69314718, 1.69314718, 1.28768207, 1.28768207,
1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.69314718,
1.69314718, 1.28768207, 1.28768207, 1.28768207, 1.28768207])
For your case you can do this (with pandas):
df_idf = pd.DataFrame(
vectorizer.idf_, index=vectorizer.get_feature_names_out(), columns=["idf_weights"]
)
display(df_idf)
Output
idf_weights
and 1.693147
architecture 1.693147
art 1.693147
beautiful 1.287682
cuisine 1.287682
culture 1.693147
diverse 1.287682
fishing 1.693147
folk 1.693147
games 1.693147
history 1.693147
resort 1.287682
scenery 1.287682
shop 1.287682
souvenir 1.287682