pythonmultiprocessingldamallet

LDA Mallet Multiprocessing Freezing


So I am trying to run LDA mallet on a dataset. It takes in lemma tokens and a bunch of texts which is our dataset. The issue is when we run, a freeze message pops up and all of our old methods that have already ran start running again. It says its due to the multiprocessing starting before the other finished. Not sure how to fix. This is ran on MacOS. Code and output are below.

import gensim
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import os.path

def optimize_parameters(lemma_tokens, texts):
    
    os.environ['MALLET_HOME'] = '****/mallet-2.0.8'
    mallet_path = '****/mallet-2.0.8/bin/mallet'

    id2word = Dictionary(lemma_tokens)

    # Filtering Extremes
    id2word.filter_extremes(no_below=2, no_above=.99)

    # Creating a corpus object 
    corpus = [id2word.doc2bow(d) for d in lemma_tokens]

    model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, workers = 4)
    coherencemodel = CoherenceModel(model=model, texts=lemma_tokens, dictionary=id2word, coherence='c_v')
    coherence = coherencemodel.get_coherence()

The "****" is the rest of the path that can't be shown due to privacy.

The error output:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
<10> LL/token: -6.83952
<20> LL/token: -6.70949

Solution

  • I figure it out. You have to put the entire script in

    if __name__ == '__main__':
      imports
      code
    

    Found solution via an old google chat. Posted link: https://groups.google.com/g/gensim/c/-gMNdkujR48/m/i4Dn1_bjBQAJ

    To summarize what is happening, due to multiprocessing, the other bits of code are run multiple times instead of the once it is supposed to. This is the same case for the actual function itself which runs the same call multiple times. The fix of the if statement checks to see if this is the first run through. If it is, then we do the entire call. If not, we don't run anything at all. This works because it makes sure that we are only running it once.