I am trying to cluster documents by keywords. I'm using the following code to make a tdidf-matrix
:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=.8, max_features=1000,
min_df=0.07, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem,
ngram_range=(1,2))
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix.shape)
returns (567, 209)
, meaning there are 567 documents, each of which has some mixture of the 209 feature words detected by the scikit-learn TdidfVectorizer.
Now, I used terms = tfidf_vectorizer.get_feature_names()
to get a list of the terms. Running print(len(terms))
gives 209
Many of these words are unnecessary for the task, and they add noise to the clustering. I have went through the list by hand and extracted the meaningful feature names, resulting in a new terms
list. Now, running print(len(terms))
gives 67
However, running tfidf_vectorizer.fit_transform(documents)
still gives a shape of (567, 209)
, which means the fit_transform(documents)
function is still using the noisy list of 209 terms rather than the hand-selected list of 67 terms.
How can I get the tfidf_vectorizer.fit_transform(documents)
function to run using the list of 67 hand-selected terms? I'm thinking that perhaps this will require me to add at least one function to the Scikit-Learn package on my machine, correct?
Any help is greatly appreciated. Thanks!
There are two ways:
If you have identified a list of stopwords (you called them "unnecessary for the task"), just put them into the stop_words
parameter of the TfidfVectorizer
to ignore them in the creation of the bag of words.
Note however that the predefined english stopwords won't be used any more if you set the stop_words
parameter to your custom list. If you want to combine the predefined english list with your additional stopwords, just add the two lists:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS) + ['your','additional', 'stopwords']
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words) # add your other params here
If you have a fixed vocabulary and only want these words to be counted (i.e. your terms
list), just set the vocabulary
parameter of TfidfVectorizer
:
tfidf_vectorizer = TfidfVectorizer(vocabulary=terms) # add your other params here