For Topic Modelling, I'm trying out the BERTopic: Link
I'm little confused here, I am trying out the BERTopic on my custom Dataset.
Since BERT was trained in such a way that it holds the semantic meaning of the text/document,
Should I be removing the stop words and stem/lemmatize my documents before passing it onto BERTopic?
Because I'm afraid if these stopwords might land into my topics as salient terms which they are not
Suggestions and Advices please!
No.
BERTopic uses transformers that are based on "real and clean" text, not on text without stopwords, lemmas or tokens. At the end of the calculation stop words have become noise (non-informative) and are all in topic_id = -1.
For the same reason you should not tokenize (done internally) or lemmatize (somewhat subjective) the text. That will mess-up your topics
A disadvantage of not lemmatizing is that the keywords of a topic have a lot of redundancy, like (topn=10) "hotel, hotels", "resort, resorts" etc. It also does not handle bigrams like "New York" or "Barack Obama" elegantly
You can't have it all ;-)
Andreas
PS: You can ofcourse remove HTML tags; they are not in your reference corpus either