Say I build a BERTopic model using
from bertopic import BERTopic
topic_model = BERTopic(n_gram_range=(1, 1), nr_topics=20)
topics, probs = topic_model.fit_transform(docs)
Inspecting probs
gives me just a single value for each item in docs
.
probs
array([0.51914467, 0. , 0. , ..., 1. , 1. ,
1. ])
I would like the entire probability vector across all topics (so in this case, where nr_topics=20
, I want a vector of 20 probabilities for each item in docs
). In other words, if I have N items in docs
and K topics, I would like an NxK output.
For individual topic probability across each document you need to add one more argument.
topic_model = BERTopic(n_gram_range=(1, 1), nr_topics=20, calculate_probabilities=True)
Note: This calculate_probabilities = True will only work if you are using HDBSCAN
clustering embedding model. And Bertopic by default uses all-MiniLM-L6-v2
.
Official documentation: https://maartengr.github.io/BERTopic/api/bertopic.html
They have mentioned the same in document as well.