I have a dataset containing pre-processed online reviews, each row contains words from online review. I am doing a Latent Dirichlet Allocation process to extract topics from the entire dataframe. Now, I want to assign topics to each row of data based on an LDA function called get_document_topics.
I found a code from a source but it only prints the probability of a document being assign to each topic. I'm trying to iterate the code to all documents and returns to the same dataset. Here's the code I found...
text = ["user"]
bow = dictionary.doc2bow(text)
print "get_document_topics", model.get_document_topics(bow)
### get_document_topics [(0, 0.74568415806946331), (1, 0.25431584193053675)]
Here's what I'm trying to get...
stemming probabOnTopic1 probOnTopic2 probaOnTopic3 topic
0 [bank, water, bank] 0.7 0.3 0.0 0
1 [baseball, rain, track] 0.1 0.8 0.1 1
2 [coin, money, money] 0.9 0.0 0.1 0
3 [vote, elect, bank] 0.2 0.0 0.8 2
Here's the codes that I'm working on...
def bow (text):
return [dictionary.doc2bow(text) in document]
df["probability"] = optimal_model.get_document_topics(bow)
df[['probOnTopic1', 'probOnTopic2', 'probOnTopic3']] = pd.DataFrame(df['probability'].tolist(), index=df.index)
slightly different approach @Christabel, that include your other request with 0.7 threshold:
import pandas as pd
results = []
# Iterate over each review
for review in df['review']:
bow = dictionary.doc2bow(review)
topics = model.get_document_topics(bow)
#to a dictionary
topic_dict = {topic[0]: topic[1] for topic in topics}
#get the prob
max_topic = max(topic_dict, key=topic_dict.get)
if topic_dict[max_topic] > 0.7:
topic = max_topic
else:
topic = 0
topic_dict['topic'] = topic
results.append(topic_dict)
#to a DF
df_topics = pd.DataFrame(results)
df = df.merge(df_topics, left_index=True, right_index=True)
Is it helpful and working for you ? You can then place this code inside of a function and use the '0.70' value as an external parameter so to make it usable in different use-cases.