I have function aimed at creating a data-frame with three columns; bigram-phrase, count (of the bigram-phrase), and PMI score (for the bigram-phrase). Since I want to run this on a large dataset with over a million phrases, the compute time is incredibly long. I recognize that the nested for-loops and matching conditions are contributing to the computation difficulties. Is there an alternative way to do the same thing and cut down run-time?
Here's my code:
def pmi_count_phrase_create(pmi_tups,freq_list):
import pandas as pd
"""pmi_tups is result of running pmi_tups = [i for i in finder.score_ngrams(bigram_measures.pmi)]
freq_list is a result of running freq_list= finder.ngram_fd.items()
-> df made up of columns for pmi list, count list, phrase list"""
pmi3_list =[]
count3_list =[]
phrase3_list =[]
for phrase, pmi in pmi_tups: #pmi_tups is list of tuples of form:[((phrase),pmi),..]
for item in freq_list:
quadgram,count = item
if quadgram == phrase:
pmi3_list.append(pmi)
count3_list.append(count)
phrase3_list.append(phrase)
# create dataframe
df = pd.DataFrame({'Phrase':phrase3_list,'PMI':pmi3_list,'Count':count3_list})
return df
Running this code on my pmi_tups and freq_list, it is still running and it's been over 1000 minutes. I'm open to also using a different library to evaluate the bi-gram phrases, pmi's and frequencies.
Ended up changing my function to convert freq_list to a dictionary and list comprehensions instead of for loops and this code instantly returned a data-frame:
def quicker_func(pmi_tups, freq_list):
import pandas as pd
freq_dict = dict(freq_list) # Create a dictionary for faster lookups
pmi_list = [pmi for phrase, pmi in pmi_tups if phrase in freq_dict]
count_list = [freq_dict[phrase] for phrase, pmi in pmi_tups if phrase in freq_dict]
phrase_list = [phrase for phrase, pmi in pmi_tups if phrase in freq_dict]
df = pd.DataFrame({'Phrase': phrase_list, 'PMI': pmi_list, 'Count': count_list})
return df