pythonpython-3.xnlptopic-modeling

BERTopic: "Make sure that the iterable only contains strings"


I'm still fairly new to Python so this might be easier than it appears to me, but I'm stuck. I'm trying to use BERTopic and visualize the results with PyLDAVis. I want to compare the results with the ones I got using LDA.

This is my code, where "data_words" is the same object that I previously used with LDA Topic Modeling:

import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
bert_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = bert_model.fit_transform(data_words)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = bert_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in bert_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [bert_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

I keep getting the following error and I don't understand how to fix the problem:

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 4
      1 from bertopic import BERTopic
      3 bert_model = BERTopic()
----> 4 topics, probs = bert_model.fit_transform(data_words)
      6 bert_model.get_topic_freq()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_bertopic.py:373, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    325 """ Fit the models on a collection of documents, generate topics,
    326 and return the probabilities and topic per document.
    327 
   (...)
    370 ```
    371 """
    372 if documents is not None:
--> 373     check_documents_type(documents)
    374     check_embeddings_shape(embeddings, documents)
    376 doc_ids = range(len(documents)) if documents is not None else range(len(images))

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_utils.py:43, in check_documents_type(documents)
     41 elif isinstance(documents, Iterable) and not isinstance(documents, str):
     42     if not any([isinstance(doc, str) for doc in documents]):
---> 43         raise TypeError("Make sure that the iterable only contains strings.")
     44 else:
     45     raise TypeError("Make sure that the documents variable is an iterable containing strings only.")

TypeError: Make sure that the iterable only contains strings.

Edit: So, I'm assuming that the data that I'm to trying to analyze aren't formatted the way in which BERTopic expects them to be. My dataset is structured like this:

{
    "TFU_1881_00102": {
        "magazine": "edited out",
        "country": "United Kingdom",
        "year": "1881",
        "tokens": [
            "word1",
            "word2"
        ],
        "bigramFreqs": {
            "word1 word2": 1
        },
        "tokenFreqs": {
            "word1": 1,
            "word2": 1
        }
    },
    "TFU_1881_00103": {
        "magazine": "edited out",
        "country": "United Kingdom",
        "year": "1881",
        "tokens": [
            "word3",
            "word4"
        ],
        "bigramFreqs": {
            "word3 word4": 1
        },
        "tokenFreqs": {
            "word3": 1,
            "word4": 1
        }
    }
}

I then create the "data_words" object with this code:

with open("Data/5_json/output_final.json", "r") as file:
    data = json.load(file)

data_words = []
counter = 0
for key in data:
    counter += 1
    sub_list = data[key]["tokens"]
    data_words.append(sub_list)
print(counter)

Edit2: What worked So, after Goku suggested to flatten my list of lists, I initially tried this solution:

flat_data_words = []

for list in data_words:
    for lists in list:
        flat_data_words.append(lists)

It apparently worked, but the code resulted in a new error. I tried to search a bit more and I found a similar topic that made me understand that BERTopic is expecting each string in the list to be a document. That wasn't my case, because the code I used to flatten my list of lists simply results in a list of single tokens. I think that that's why I was getting the new error. Then I tried this and now it seemingly works:

flat_data_words = []

for list_of_strings in data_words:
    sentence = ' '.join(list_of_strings)
    flat_data_words.append(sentence)

Solution

  • data_words is a nested list.

    It contains lists and strings.


    bert_model.fit_transform(data_words)
    

    .fit() is expecting an iterable with only strings.

    You can try flattening data_words so that it only contains strings and then use :

    bert_model.fit_transform(data_words)
    

    A related issue: https://github.com/meghutch/tracking_pasc/blob/main/BERTopic%20Preprocessing%20Test%20using%20120%2C000%20test%20tweets.ipynb