I am running Bert topic with default options
import pandas as pd
from sentence_transformers import SentenceTransformer
import time
import pickle
from bertopic import BERTopic
llm_mod = "all-MiniLM-L6-v2"
model = SentenceTransformer(llm_mod)
embeddings = model.encode(skills_augmented, show_progress_bar=True)
bertopic_model = BERTopic(verbose=True)
I have a dataset of 40,000 documents that are only one short sentence. 13,573 of the documents get placed in the -1 topic (below distribution across top 5 topics).
-1 13573
0 1593
1 1043
2 628
3 627
From the documentation: The -1 refers to all outliers and should typically be ignored. Is there a parameter I can use to get less documents in -1? Perhaps get a more even distribution across topics? Would running kmeans be better?
From the documentation :
The main way to reduce your outliers in BERTopic is by using the .reduce_outliers function. To make it work without too much tweaking, you will only need to pass the docs and their corresponding topics. You can pass outlier and non-outlier documents together since it will only try to reduce outlier documents and label them to a non-outlier topic.
The following is a minimal example:
from bertopic import BERTopic
# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Reduce outliers
new_topics = topic_model.reduce_outliers(docs, topics)
You can find all the Strategies for reducing outliers in this page Outlier reduction