I've been working on a topic modelling project using BERTopic 0.16.3, and the preliminary results were promising. However, as the project progressed and the requirements became apparent, I ran into a specific issue with scalability.
Specifically:
That last requirement necessitates batching the documents, since loading them all into memory at once requires linear memory. So, I've been looking into clustering algorithms that work with online topic modelling. BERTopic's documentation suggests scikit-learn's MiniBatchKMeans
, but the results I'm getting from that aren't very good.
Some models I've looked at include:
Birch
via scikit-learn: uses even more memory than BERTopic's default HDBSCAN
even when batched. Also runs much slower.IncrementalDBSCAN
via incdbscan: Seemed promising at first, but the runtime and eventually memory ballooned. For ~120k documents in batches of 5000, it didn't use more than 4GB of RAM in the first 3½ hours, but didn't finish within ten hours, and used nearly 40GB of RAM at some point in the middle.AgglomerativeClustering
via scikit-learn: gave very good results from initial testing (perhaps even better than HDBSCAN
), but it doesn't implement the partial_fit
method. I found this answer on a different question which suggests it's possible to train two of them using single linkage independently and then merge them, but it gives no indication as to how.The latter two also don't provide the predict
method, limiting their utility.
I am fairly new to the subject, so perhaps I'm approaching this completely wrong and the immediate problem I'm trying to solve has no solution. So to be clear, at the base level, the question I'm trying to answer is: How do I perform topic modelling (and get good results) on a large number of documents without using too much memory?
In general, advanced techniques like UMAP and HDBSCAN are helpful in producing high quality results on larger datasets but will take more memory. Unless it's absolutely required, you may want to consider relaxing this constraint for the sake of performance, real-world human time, and actual cost (hourly instance or otherwise).
At this scale for a workflow you expect to go to production, rather than trying to work around this in software it may be easier to switch hardware. The GPU-accelerated UMAP and HDBSCAN in cuML can handle this much data very quickly -- quick enough that it's probably worth considering renting a GPU-enabled system if you don't have one locally.
For the following example, I took a sample of one million Amazon reviews, encoded them into embeddings (384 dimensions), and used the GPU UMAP and HDBSCAN in the current cuML release (v24.08). I ran this on a system with an H100 GPU.
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import pandas as pd
from cuml.manifold.umap import UMAP
from cuml.cluster import HDBSCAN
df = pd.read_json("Electronics.json.gz", lines=True, nrows=1000000)
reviews = df.reviewText.tolist()
# Create embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(reviews, batch_size=1024, show_progress_bar=True)
reducer = UMAP(n_components=5)
%time reduced_embeddings = reducer.fit_transform(embeddings)
CPU times: user 1min 33s, sys: 7.2 s, total: 1min 40s
Wall time: 7.31 s
clusterer = HDBSCAN()
%time clusterer.fit(reduced_embeddings)
CPU times: user 21.5 s, sys: 125 ms, total: 21.6 s
Wall time: 21.6 s
There's an example of how to run these steps on GPUs in the BERTopic FAQs.
I work on these projects at NVIDIA and am a community contributor to BERTopic, so if you run into any issues please let me know and file a Github issue.