pythonperformanceoptimizationnlpnltk

How to optimize this function and improve running time?


I have function aimed at creating a data-frame with three columns; bigram-phrase, count (of the bigram-phrase), and PMI score (for the bigram-phrase). Since I want to run this on a large dataset with over a million phrases, the compute time is incredibly long. I recognize that the nested for-loops and matching conditions are contributing to the computation difficulties. Is there an alternative way to do the same thing and cut down run-time?

Here's my code:

def pmi_count_phrase_create(pmi_tups,freq_list):

    import pandas as pd

    """pmi_tups is result of running pmi_tups = [i for i in finder.score_ngrams(bigram_measures.pmi)]  
       freq_list is a result of running freq_list= finder.ngram_fd.items() 
       
       -> df made up of columns for  pmi list, count list, phrase list"""
    pmi3_list =[]
    count3_list =[]
    phrase3_list =[]
    for phrase, pmi in pmi_tups: #pmi_tups is list of tuples of form:[((phrase),pmi),..]
        for item in freq_list:  
            quadgram,count = item
            if quadgram == phrase:
                pmi3_list.append(pmi)
                count3_list.append(count)
                phrase3_list.append(phrase)

                # create dataframe
    df = pd.DataFrame({'Phrase':phrase3_list,'PMI':pmi3_list,'Count':count3_list})
    return df 

Running this code on my pmi_tups and freq_list, it is still running and it's been over 1000 minutes. I'm open to also using a different library to evaluate the bi-gram phrases, pmi's and frequencies.


Solution

  • Ended up changing my function to convert freq_list to a dictionary and list comprehensions instead of for loops and this code instantly returned a data-frame:

    def quicker_func(pmi_tups, freq_list):
        import pandas as pd
        freq_dict = dict(freq_list)  # Create a dictionary for faster lookups 
    
        pmi_list = [pmi for phrase, pmi in pmi_tups if phrase in freq_dict]
        count_list = [freq_dict[phrase] for phrase, pmi in pmi_tups if phrase in freq_dict]
        phrase_list = [phrase for phrase, pmi in pmi_tups if phrase in freq_dict]
    
        df = pd.DataFrame({'Phrase': phrase_list, 'PMI': pmi_list, 'Count': count_list})
        return df