I try to do Topic Modeling (with german stop words and german text) after the explanation from: Albrecht, Jens, Sidharth Ramachandran, und Christian Winkler. Blueprints for text analysis using Python: machine learning-based solutions for common real world (NLP) applications. First edition. Sebastopol, CA: O’Reilly Media, Inc, 2020., page 209 ff.
# Load Data
import pandas as pd
# csv Datei über read_csv laden
xlsx = pd.ExcelFile("Priorisierung_der_Anforderungen.xlsx")
df = pd.read_excel(xlsx)
# Anforderungsbeschreibung in String umwandlen
df=df.astype({'Anforderungsbeschreibung':'string'})
df.info()
# "Ignore spaces after the stop..."
import re
df["paragraphs"] = df["Anforderungsbeschreibung"].map(lambda text:re.split('\.\s*\n', text))
df["number_of_paragraphs"] = df["paragraphs"].map(len)
%matplotlib inline
df.groupby('Title').agg({'number_of_paragraphs': 'mean'}).plot.bar(figsize=(24,12))
# Preparations
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS as stopwords
tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])
tfidf_text_vectors.shape
I receive this error message:
InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.
InvalidParameterError Traceback (most recent call last)
Cell In[8], line 4
1 #tfidf_text_vectorizer = = TfidfVectorizer(stop_words=stopwords.words('german'),)
3 tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
----> 4 tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])
5 tfidf_text_vectors.shape
InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.
Thynk you for any tips. Sebastian
The stopwords you've imported from Spacy isn't a list.
from spacy.lang.de.stop_words import STOP_WORDS
type(STOP_WORDS)
[out]:
set
Cast the stopwords into a list and it should work as expected.
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS
tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))