I'm still fairly new to Python so this might be easier than it appears to me, but I'm stuck. I'm trying to use BERTopic and visualize the results with PyLDAVis. I want to compare the results with the ones I got using LDA.
This is my code, where "data_words" is the same object that I previously used with LDA Topic Modeling:
import pyLDAvis
import numpy as np
from bertopic import BERTopic
# Train Model
bert_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = bert_model.fit_transform(data_words)
# Prepare data for PyLDAVis
top_n = 5
topic_term_dists = bert_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in bert_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [bert_model.vectorizer_model.vocabulary_[word] for word in vocab]
data = {'topic_term_dists': topic_term_dists,
'doc_topic_dists': doc_topic_dists,
'doc_lengths': doc_lengths,
'vocab': vocab,
'term_frequency': term_frequency}
# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)
I keep getting the following error and I don't understand how to fix the problem:
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[9], line 4
1 from bertopic import BERTopic
3 bert_model = BERTopic()
----> 4 topics, probs = bert_model.fit_transform(data_words)
6 bert_model.get_topic_freq()
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_bertopic.py:373, in BERTopic.fit_transform(self, documents, embeddings, images, y)
325 """ Fit the models on a collection of documents, generate topics,
326 and return the probabilities and topic per document.
327
(...)
370 ```
371 """
372 if documents is not None:
--> 373 check_documents_type(documents)
374 check_embeddings_shape(embeddings, documents)
376 doc_ids = range(len(documents)) if documents is not None else range(len(images))
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_utils.py:43, in check_documents_type(documents)
41 elif isinstance(documents, Iterable) and not isinstance(documents, str):
42 if not any([isinstance(doc, str) for doc in documents]):
---> 43 raise TypeError("Make sure that the iterable only contains strings.")
44 else:
45 raise TypeError("Make sure that the documents variable is an iterable containing strings only.")
TypeError: Make sure that the iterable only contains strings.
Edit: So, I'm assuming that the data that I'm to trying to analyze aren't formatted the way in which BERTopic expects them to be. My dataset is structured like this:
{
"TFU_1881_00102": {
"magazine": "edited out",
"country": "United Kingdom",
"year": "1881",
"tokens": [
"word1",
"word2"
],
"bigramFreqs": {
"word1 word2": 1
},
"tokenFreqs": {
"word1": 1,
"word2": 1
}
},
"TFU_1881_00103": {
"magazine": "edited out",
"country": "United Kingdom",
"year": "1881",
"tokens": [
"word3",
"word4"
],
"bigramFreqs": {
"word3 word4": 1
},
"tokenFreqs": {
"word3": 1,
"word4": 1
}
}
}
I then create the "data_words" object with this code:
with open("Data/5_json/output_final.json", "r") as file:
data = json.load(file)
data_words = []
counter = 0
for key in data:
counter += 1
sub_list = data[key]["tokens"]
data_words.append(sub_list)
print(counter)
Edit2: What worked So, after Goku suggested to flatten my list of lists, I initially tried this solution:
flat_data_words = []
for list in data_words:
for lists in list:
flat_data_words.append(lists)
It apparently worked, but the code resulted in a new error. I tried to search a bit more and I found a similar topic that made me understand that BERTopic is expecting each string in the list to be a document. That wasn't my case, because the code I used to flatten my list of lists simply results in a list of single tokens. I think that that's why I was getting the new error. Then I tried this and now it seemingly works:
flat_data_words = []
for list_of_strings in data_words:
sentence = ' '.join(list_of_strings)
flat_data_words.append(sentence)
data_words
is a nested list.
It contains lists
and strings
.
bert_model.fit_transform(data_words)
.fit()
is expecting an iterable
with only strings
.
You can try flattening data_words
so that it only contains strings and then use :
bert_model.fit_transform(data_words)
A related issue: https://github.com/meghutch/tracking_pasc/blob/main/BERTopic%20Preprocessing%20Test%20using%20120%2C000%20test%20tweets.ipynb