pythonnlpbert-language-modeltopic-modeling

Bert topic clasiffying over a quarter of documents in outlier topic -1


I am running Bert topic with default options

import pandas as pd
from sentence_transformers import SentenceTransformer
import time
import pickle
from bertopic import BERTopic

llm_mod =  "all-MiniLM-L6-v2"
model = SentenceTransformer(llm_mod)

embeddings = model.encode(skills_augmented, show_progress_bar=True)
bertopic_model = BERTopic(verbose=True)

I have a dataset of 40,000 documents that are only one short sentence. 13,573 of the documents get placed in the -1 topic (below distribution across top 5 topics).

-1      13573
 0       1593
 1       1043
 2        628
 3        627

From the documentation: The -1 refers to all outliers and should typically be ignored. Is there a parameter I can use to get less documents in -1? Perhaps get a more even distribution across topics? Would running kmeans be better?


Solution

  • From the documentation :

    The main way to reduce your outliers in BERTopic is by using the .reduce_outliers function. To make it work without too much tweaking, you will only need to pass the docs and their corresponding topics. You can pass outlier and non-outlier documents together since it will only try to reduce outlier documents and label them to a non-outlier topic.

    The following is a minimal example:

    from bertopic import BERTopic
    
    # Train your BERTopic model
    topic_model = BERTopic()
    topics, probs = topic_model.fit_transform(docs)
    
    # Reduce outliers
    new_topics = topic_model.reduce_outliers(docs, topics)
    

    You can find all the Strategies for reducing outliers in this page Outlier reduction