I have a dataset that I'm trying to cluster into. Although I set min_df and max_df in the Tfidf, the output MiniBatchKmeans returns to me contains words that according to the documentation Vectorizer should eliminate because they are present in at least one other document (max_df=1.).
The tfidf settings:
min_df = 5
max_df = 1.
vectorizer = TfidfVectorizer(stop_words='english',min_df=min_df,
max_df=max_df, max_features=100000) ## Corpus is in English
c_vectorizer = CountVectorizer(stop_words='english',min_df=min_df,
max_df=max_df, max_features=100000) ## Corpus is in English
X = vectorizer.fit_transform(dataset)
C_X = c_vectorizer.fit_transform(dataset)
The output of MiniBatchKMeans:
Topic0: information book history read good great lot author write
useful use recommend need time make know provide like easy
excellent just learn look work want help reference buy guide
interested
Topic1: book read good great use make write buy time work like
just recommend know look year need author want think help new life
way love people really excellent easy say
Topic2: story novel character book life read love time write make
like reader great end woman world good man work plot way people
just family know come young author think year
As you can see "book" is in all the 3 topic, but with max_df=1. Shouldn't it be deleted?
From the TfidfVectorizer documentation:
max_df
:float
orint
,default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
So the max_df
in the question is set to the default value.
You probably want something like: "Remove words that occur in more than 99% of documents":
from sklearn.feature_extraction.text import TfidfVectorizer
raw_data = [
"books cats coffee",
"books cats",
"books and coffee and coffee",
"books and words and coffee",
]
tfidf = TfidfVectorizer(stop_words="english", max_df=0.99)
X = tfidf.fit_transform(raw_data)
print(tfidf.get_feature_names_out())
print(X.todense())
['cats' 'coffee' 'words']
[[0.77722116 0.62922751 0. ]
[1. 0. 0. ]
[0. 1. 0. ]
[0. 0.53802897 0.84292635]]
If you really do want to remove any words that are present in at least one other document, the CountVectorizer is a better approach:
from sklearn.feature_extraction.text import CountVectorizer
raw_data = [
"unique books cats coffee",
"case books cats",
"for books and words coffee and coffee",
"each books and words and coffee",
]
tfidf = CountVectorizer(max_df=1)
X = tfidf.fit_transform(raw_data)
print(tfidf.get_feature_names_out())
print(X.todense())
['case' 'each' 'for' 'unique']
[[0 0 0 1]
[1 0 0 0]
[0 0 1 0]
[0 1 0 0]]