pythonnlpgensimldatopic-modeling

Sorted document topic matrix gensim LDA


I have a corpus that I ran LDA on using gensim, and I'm trying to get a matrix in which rows are documents and columns are topics. I ran used the line of code below, but in the output, scores don't correspond to columns. I want to change this so that in the 0 column, you only have the probability of topic 0, likewise in the 1, 2, etc. columns.

Does anyone know how to do this?

DocTopMat = pd.DataFrame(model.get_document_topics(corpus),columns=[i for i in range(model.num_topics)])

enter image description here


Solution

  • I'm assuming as of now, the data that you have is in tuples; so try this to get the probabilities in the respective columns;

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({0:[(1,0.22),(0,0.08),(1,0.34),(1,0.87),(0,0.37)],
                       1:[(2,0.78),(1,0.92),(2,0.66),(3,0.13),(2,0.34)],
                       2:[np.nan,np.nan,np.nan,np.nan,(3,0.28)],
                       3:[np.nan,np.nan,np.nan,np.nan,np.nan],
                       4:[np.nan,np.nan,np.nan,np.nan,(4,0.01)]})
    
    df.fillna("na",inplace=True)
    
    df["topic_0"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 0]),axis=1)# if i[0] == 0 else np.nan],axis=1)
    df["topic_1"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 1]),axis=1)# if i[0] == 0 else np.nan],axis=1)
    df["topic_2"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 2]),axis=1)# if i[0] == 0 else np.nan],axis=1)
    df["topic_3"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 3]),axis=1)# if i[0] == 0 else np.nan],axis=1)
    df["topic_4"] = df[[0,1,2,3,4]].apply(lambda x: sum([i[1] for i in x if i[0] == 4]),axis=1)# if i[0] == 0 else np.nan],axis=1)
    

    Output of df;

    enter image description here