I have a corpus that I ran LDA on using gensim, and I'm trying to get a matrix in which rows are documents and columns are topics. I ran used the line of code below, but in the output, scores don't correspond to columns. I want to change this so that in the 0 column, you only have the probability of topic 0, likewise in the 1, 2, etc. columns.
Does anyone know how to do this?
DocTopMat = pd.DataFrame(model.get_document_topics(corpus),columns=[i for i in range(model.num_topics)])
I'm assuming as of now, the data that you have is in tuples; so try this to get the probabilities in the respective columns;
import pandas as pd
import numpy as np
df = pd.DataFrame({0:[(1,0.22),(0,0.08),(1,0.34),(1,0.87),(0,0.37)],
1:[(2,0.78),(1,0.92),(2,0.66),(3,0.13),(2,0.34)],
2:[np.nan,np.nan,np.nan,np.nan,(3,0.28)],
3:[np.nan,np.nan,np.nan,np.nan,np.nan],
4:[np.nan,np.nan,np.nan,np.nan,(4,0.01)]})
df.fillna("na",inplace=True)
df["topic_0"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 0]),axis=1)# if i[0] == 0 else np.nan],axis=1)
df["topic_1"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 1]),axis=1)# if i[0] == 0 else np.nan],axis=1)
df["topic_2"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 2]),axis=1)# if i[0] == 0 else np.nan],axis=1)
df["topic_3"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 3]),axis=1)# if i[0] == 0 else np.nan],axis=1)
df["topic_4"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 4]),axis=1)# if i[0] == 0 else np.nan],axis=1)
Output of df;