nlptopic-modelingtfidfvectorizer

how can I solve the error: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None?


I try to do Topic Modeling (with german stop words and german text) after the explanation from: Albrecht, Jens, Sidharth Ramachandran, und Christian Winkler. Blueprints for text analysis using Python: machine learning-based solutions for common real world (NLP) applications. First edition. Sebastopol, CA: O’Reilly Media, Inc, 2020., page 209 ff.

# Load Data
import pandas as pd
# csv Datei über read_csv laden
xlsx = pd.ExcelFile("Priorisierung_der_Anforderungen.xlsx")
df = pd.read_excel(xlsx)

# Anforderungsbeschreibung in String umwandlen
df=df.astype({'Anforderungsbeschreibung':'string'})
df.info()

# "Ignore spaces after the stop..."
import re
df["paragraphs"] = df["Anforderungsbeschreibung"].map(lambda text:re.split('\.\s*\n', text))
df["number_of_paragraphs"] = df["paragraphs"].map(len)

%matplotlib inline
df.groupby('Title').agg({'number_of_paragraphs': 'mean'}).plot.bar(figsize=(24,12))


# Preparations
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS as stopwords

tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])
tfidf_text_vectors.shape

I receive this error message:

 InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.   

InvalidParameterError                     Traceback (most recent call last)
Cell In[8], line 4
  1 #tfidf_text_vectorizer = = TfidfVectorizer(stop_words=stopwords.words('german'),)
  3 tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
----> 4 tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])
  5 tfidf_text_vectors.shape

InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.

Thynk you for any tips. Sebastian


Solution

  • The stopwords you've imported from Spacy isn't a list.

    from spacy.lang.de.stop_words import STOP_WORDS
    
    type(STOP_WORDS)
    

    [out]:

    set
    

    Cast the stopwords into a list and it should work as expected.

    from sklearn.feature_extraction.text import TfidfVectorizer
    from spacy.lang.de.stop_words import STOP_WORDS
    
    
    tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))