pythontf-idf

How to get the value in TFIDF transformer?


I'm new to Python and recently learning text processing using Bag of Words and TFIDF.

I was trying to get the word in column 1001 in my TFIDF by using the following codes:

count_vectorizer = CountVectorizer()
bag_of_words = count_vectorizer.fit_transform(df)

TFIDF_transformer = TfidfTransformer(norm = 'l2')
TFIDF_representation = TFIDF_transformer.fit_transform(bag_of_words)

TFIDF_transformer.get_feature_names_out()[1000]

and the output is "x1000", a token (I assume) instead of a word.

How can I get the exact word in column 1001 in my TFIDF? Am I using the wrong function or missing other steps to interpret the token I get?


Solution

  • The count vectorizer returns a sparse matrix which doesn't have column names, you need to convert this to a dataframe and then add the words as the column names by pulling them out of the CountVectorizer:

    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    
    count_vectorizer = CountVectorizer()
    bag_of_words = count_vectorizer.fit_transform(df)
    
    ### Turn sparse array into dense pandas dataframe and add column names (words/tokens)
    bag_of_words = pd.DataFrame(bag_of_words.toarray(), columns=count_vectorizer.get_feature_names_out())
    
    TFIDF_transformer = TfidfTransformer(norm = 'l2')
    TFIDF_representation = TFIDF_transformer.fit_transform(bag_of_words)
    

    Alternatively, I'll offer that if you're just after TF-IDF vectorization, it will probably be simpler to use the TF-IDF vectorizer directly, as opposed to using the TfidfTransformer:

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    TFIDF = TfidfVectorizer()
    TFIDF_representation = TFIDF.fit_transform(df)
    
    TFIDF_transformer.get_feature_names_out()