pythontf-idftfidfvectorizer

How to extract calculations using tf-idf


I used TfidfVectorizer to extract TF-IDF but don't know how it calculates the results. When I calculate it manually, I get a different answer, so I want to extract the values ​​that the function calculates in order to learn how it works.

data = ['Souvenir shop|Architecture and art|Culture and history', 'Souvenir shop|Resort|Diverse cuisine|Fishing|Folk games|Beautiful scenery', 'Diverse cuisine|Resort|Beautiful scenery']

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data)

Solution

  • Have a look in the scikit documentation at the attributes section.

    Try this:

    print(vectorizer.vocabulary_)
    

    Output

    {'souvenir': 14,
     'shop': 13,
     'architecture': 1,
     'and': 0,
     'art': 2,
     'culture': 5,
     'history': 10,
     'resort': 11,
     'diverse': 6,
     'cuisine': 4,
     'fishing': 7,
     'folk': 8,
     'games': 9,
     'beautiful': 3,
     'scenery': 12}
    

    You get the idf calculations with print(vectorizer.idf_)

    Output

    array([1.69314718, 1.69314718, 1.69314718, 1.28768207, 1.28768207,
           1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.69314718,
           1.69314718, 1.28768207, 1.28768207, 1.28768207, 1.28768207])
    

    For your case you can do this (with pandas):

    df_idf = pd.DataFrame(
        vectorizer.idf_, index=vectorizer.get_feature_names_out(), columns=["idf_weights"]
    )
    
    display(df_idf)
    

    Output

                 idf_weights
    and          1.693147
    architecture 1.693147
    art          1.693147
    beautiful    1.287682
    cuisine      1.287682
    culture      1.693147
    diverse      1.287682
    fishing      1.693147
    folk         1.693147
    games        1.693147
    history      1.693147
    resort       1.287682
    scenery      1.287682
    shop         1.287682
    souvenir     1.287682