So I am trying to run LDA mallet on a dataset. It takes in lemma tokens and a bunch of texts which is our dataset. The issue is when we run, a freeze message pops up and all of our old methods that have already ran start running again. It says its due to the multiprocessing starting before the other finished. Not sure how to fix. This is ran on MacOS. Code and output are below.
import gensim
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import os.path
def optimize_parameters(lemma_tokens, texts):
os.environ['MALLET_HOME'] = '****/mallet-2.0.8'
mallet_path = '****/mallet-2.0.8/bin/mallet'
id2word = Dictionary(lemma_tokens)
# Filtering Extremes
id2word.filter_extremes(no_below=2, no_above=.99)
# Creating a corpus object
corpus = [id2word.doc2bow(d) for d in lemma_tokens]
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, workers = 4)
coherencemodel = CoherenceModel(model=model, texts=lemma_tokens, dictionary=id2word, coherence='c_v')
coherence = coherencemodel.get_coherence()
The "****" is the rest of the path that can't be shown due to privacy.
The error output:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
<10> LL/token: -6.83952
<20> LL/token: -6.70949
I figure it out. You have to put the entire script in
if __name__ == '__main__':
imports
code
Found solution via an old google chat. Posted link: https://groups.google.com/g/gensim/c/-gMNdkujR48/m/i4Dn1_bjBQAJ
To summarize what is happening, due to multiprocessing, the other bits of code are run multiple times instead of the once it is supposed to. This is the same case for the actual function itself which runs the same call multiple times. The fix of the if statement checks to see if this is the first run through. If it is, then we do the entire call. If not, we don't run anything at all. This works because it makes sure that we are only running it once.