pythonnlpbert-language-modeltopic-modeling

Removal of Stop Words and Stemming/Lemmatization for BERTopic


For Topic Modelling, I'm trying out the BERTopic: Link

I'm little confused here, I am trying out the BERTopic on my custom Dataset.
Since BERT was trained in such a way that it holds the semantic meaning of the text/document, Should I be removing the stop words and stem/lemmatize my documents before passing it onto BERTopic? Because I'm afraid if these stopwords might land into my topics as salient terms which they are not

Suggestions and Advices please!


Solution

  • No.

    BERTopic uses transformers that are based on "real and clean" text, not on text without stopwords, lemmas or tokens. At the end of the calculation stop words have become noise (non-informative) and are all in topic_id = -1.

    For the same reason you should not tokenize (done internally) or lemmatize (somewhat subjective) the text. That will mess-up your topics

    A disadvantage of not lemmatizing is that the keywords of a topic have a lot of redundancy, like (topn=10) "hotel, hotels", "resort, resorts" etc. It also does not handle bigrams like "New York" or "Barack Obama" elegantly

    You can't have it all ;-)

    Andreas

    PS: You can ofcourse remove HTML tags; they are not in your reference corpus either