pythonsortingtupleslda

Extract Topic Scores for Documents LDA Gensim Python problem of sorting tuples


the question https://stackoverflow.com/questions/70295773/extract-topic-scores-for-documents-lda-gensim-python is not simillar with mine. i tried a lot. I am trying to extract topic scores for documents in my dataset after using and LDA model. Specifically, I have followed most of the code from here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

typeError: '<' not supported between instances of 'tuple' and 'int'

dominant topic for each document

def format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()
#Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output

    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

i Tried to solve this but no luck. first i tried this

row = sorted(list(row), key=lambda x: (x[1]), reverse=True)

then i tried

sorted(row[0],reverse=True)

which leads to another problem of pandas version related to df.append. which is dpericated and i solved that using pd.concat(). but the sort function got me stuck. I got the problem in pandas after i used such a sort which is wrong any help would be appreciated


Solution

  • This is a clear solution. Both the sorting and dataframe.append problems resolved. if anyone is following the link above and have an issue with both sort and append issues you can resolve it using this.

     def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data1):
    
    # Init output
        final = []
        # Get main topic in each document
        for i, row_list in enumerate(ldamodel[corpus]):
            row = row_list[0] if ldamodel.per_word_topics else row_list
            row = sorted(row, key=lambda x: (x[1]),reverse=True)
            # Get the Dominant topic, Perc Contribution and Keywords for each document
            for j, (topic_num, prop_topic) in enumerate(row):
                if j == 0:  # => dominant topic
                    wp = ldamodel.show_topic(topic_num)
                    topic_keywords = ", ".join([word for word, prop in wp])
                    lists1 = int(topic_num), round(prop_topic,4),topic_keywords
                    final.append(lists1)
                else:
                    break
        sent_topics_df = pd.DataFrame(final, columns=['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'])
        contents = pd.Series(texts)
        sent_topics_df = pd.concat([sent_topics_df,contents], axis=1)
    
    
        return(sent_topics_df)
    
    
    df_topic_sents_keywords = 
    format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=texts)
    
    # Format
    df_dominant_topic = df_topic_sents_keywords.reset_index()
    df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
    
    # Show
    df_dominant_topic.head(10)