pythoncluster-analysistopic-modeling

Topic modelling many documents with low memory overhead


I've been working on a topic modelling project using BERTopic 0.16.3, and the preliminary results were promising. However, as the project progressed and the requirements became apparent, I ran into a specific issue with scalability.

Specifically:

That last requirement necessitates batching the documents, since loading them all into memory at once requires linear memory. So, I've been looking into clustering algorithms that work with online topic modelling. BERTopic's documentation suggests scikit-learn's MiniBatchKMeans, but the results I'm getting from that aren't very good.

Some models I've looked at include:

The latter two also don't provide the predict method, limiting their utility.

I am fairly new to the subject, so perhaps I'm approaching this completely wrong and the immediate problem I'm trying to solve has no solution. So to be clear, at the base level, the question I'm trying to answer is: How do I perform topic modelling (and get good results) on a large number of documents without using too much memory?


Solution

  • In general, advanced techniques like UMAP and HDBSCAN are helpful in producing high quality results on larger datasets but will take more memory. Unless it's absolutely required, you may want to consider relaxing this constraint for the sake of performance, real-world human time, and actual cost (hourly instance or otherwise).

    At this scale for a workflow you expect to go to production, rather than trying to work around this in software it may be easier to switch hardware. The GPU-accelerated UMAP and HDBSCAN in cuML can handle this much data very quickly -- quick enough that it's probably worth considering renting a GPU-enabled system if you don't have one locally.

    For the following example, I took a sample of one million Amazon reviews, encoded them into embeddings (384 dimensions), and used the GPU UMAP and HDBSCAN in the current cuML release (v24.08). I ran this on a system with an H100 GPU.

    from bertopic import BERTopic
    from sentence_transformers import SentenceTransformer
    import pandas as pd
    from cuml.manifold.umap import UMAP
    from cuml.cluster import HDBSCAN
    
    df = pd.read_json("Electronics.json.gz", lines=True, nrows=1000000)
    reviews = df.reviewText.tolist()
    
    # Create embeddings
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(reviews, batch_size=1024, show_progress_bar=True)
    
    reducer = UMAP(n_components=5)
    %time reduced_embeddings = reducer.fit_transform(embeddings)
    CPU times: user 1min 33s, sys: 7.2 s, total: 1min 40s
    Wall time: 7.31 s
    
    clusterer = HDBSCAN()
    %time clusterer.fit(reduced_embeddings)
    CPU times: user 21.5 s, sys: 125 ms, total: 21.6 s
    Wall time: 21.6 s
    

    There's an example of how to run these steps on GPUs in the BERTopic FAQs.

    I work on these projects at NVIDIA and am a community contributor to BERTopic, so if you run into any issues please let me know and file a Github issue.