I'm new to Python and recently learning text processing using Bag of Words and TFIDF.
I was trying to get the word in column 1001 in my TFIDF by using the following codes:
count_vectorizer = CountVectorizer()
bag_of_words = count_vectorizer.fit_transform(df)
TFIDF_transformer = TfidfTransformer(norm = 'l2')
TFIDF_representation = TFIDF_transformer.fit_transform(bag_of_words)
TFIDF_transformer.get_feature_names_out()[1000]
and the output is "x1000", a token (I assume) instead of a word.
How can I get the exact word in column 1001 in my TFIDF? Am I using the wrong function or missing other steps to interpret the token I get?
The count vectorizer returns a sparse matrix which doesn't have column names, you need to convert this to a dataframe and then add the words as the column names by pulling them out of the CountVectorizer
:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
count_vectorizer = CountVectorizer()
bag_of_words = count_vectorizer.fit_transform(df)
### Turn sparse array into dense pandas dataframe and add column names (words/tokens)
bag_of_words = pd.DataFrame(bag_of_words.toarray(), columns=count_vectorizer.get_feature_names_out())
TFIDF_transformer = TfidfTransformer(norm = 'l2')
TFIDF_representation = TFIDF_transformer.fit_transform(bag_of_words)
Alternatively, I'll offer that if you're just after TF-IDF vectorization, it will probably be simpler to use the TF-IDF vectorizer directly, as opposed to using the TfidfTransformer
:
from sklearn.feature_extraction.text import TfidfVectorizer
TFIDF = TfidfVectorizer()
TFIDF_representation = TFIDF.fit_transform(df)
TFIDF_transformer.get_feature_names_out()