scikit-learndata-sciencek-meanstfidfvectorizer

TfidfVectorizer it does not eliminate words that occur more than once


I have a dataset that I'm trying to cluster into. Although I set min_df and max_df in the Tfidf, the output MiniBatchKmeans returns to me contains words that according to the documentation Vectorizer should eliminate because they are present in at least one other document (max_df=1.).

The tfidf settings:

min_df = 5            
max_df = 1.         
vectorizer = TfidfVectorizer(stop_words='english',min_df=min_df, 
max_df=max_df,  max_features=100000) ## Corpus is in English
c_vectorizer = CountVectorizer(stop_words='english',min_df=min_df,   
max_df=max_df, max_features=100000) ## Corpus is in English
X = vectorizer.fit_transform(dataset)
C_X = c_vectorizer.fit_transform(dataset)

The output of MiniBatchKMeans:

Topic0: information book history read good great lot author write    
useful use recommend need time make know provide like easy   
excellent just learn look work want help reference buy guide 
interested
Topic1: book read good great use make write buy time work like   
just recommend know look year need author want think help new life 
way love people really excellent easy say
Topic2: story novel character book life read love time write make   
like reader great end woman world good man work plot way people  
just family know come young author think year

As you can see "book" is in all the 3 topic, but with max_df=1. Shouldn't it be deleted?


Solution

  • From the TfidfVectorizer documentation:

    max_df: float or int, default=1.0

    When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

    So the max_df in the question is set to the default value.

    You probably want something like: "Remove words that occur in more than 99% of documents":

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    raw_data = [
        "books cats coffee",
        "books cats",
        "books and coffee and coffee",
        "books and words and coffee",
    ]
    
    tfidf = TfidfVectorizer(stop_words="english", max_df=0.99)
    X = tfidf.fit_transform(raw_data)
    
    print(tfidf.get_feature_names_out())
    print(X.todense())
    
    ['cats' 'coffee' 'words']
    [[0.77722116 0.62922751 0.        ]
     [1.         0.         0.        ]
     [0.         1.         0.        ]
     [0.         0.53802897 0.84292635]]
    

    If you really do want to remove any words that are present in at least one other document, the CountVectorizer is a better approach:

    from sklearn.feature_extraction.text import CountVectorizer
    
    raw_data = [
        "unique books cats coffee",
        "case books cats",
        "for books and words coffee and coffee",
        "each books and words and coffee",
    ]
    
    tfidf = CountVectorizer(max_df=1)
    X = tfidf.fit_transform(raw_data)
    
    print(tfidf.get_feature_names_out())
    print(X.todense())
    
    ['case' 'each' 'for' 'unique']
    [[0 0 0 1]
     [1 0 0 0]
     [0 0 1 0]
     [0 1 0 0]]