python-3.xpandastf-idfcolumnsortingtfidfvectorizer

Can we get columns names sorted in the order of their tf-idf values (if exists) for each document?


I'm using sklearn TfIdfVectorizer. I'm trying to get the column names in a list in the order of thier tf-idf values in decreasing order for each document? So basically, If a document has all the stop words then we don't need any column names.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

msg = ["My name is Venkatesh",
       "Trying to get the significant words for each vector",
       "I want to get the list of words name in the decresasing order of their tf-idf values for each vector",
       "is to my"]

stopwords=['is','to','my','the','for','in','of','i','their']

tfidf_vect = TfidfVectorizer(stop_words=stopwords)

tfidf_matrix=tfidf_vect.fit_transform(msg)

pd.DataFrame(tfidf_matrix.toarray(), 
                 columns=tfidf_vect.get_feature_names_out())

enter image description here

I want to generate a column with the list word names in the decreasing order of their tf-idf values So the column would be like this

    ['venkatesh','name']
    ['significant','trying','vector','words','each','get']
    ['decreasing','idf','list','order','tf','values','want','each','get','name','vector','words']
    [] # empty list Since the document consists only stopwords

Above is the primary result I'm looking for, it would be great if we get the sorted dict with tdfidf values as keys and the list of words as values asociated with that tfidf value for each document

So,the result would be like the below

{'0.785288':['venkatesh'],'0.619130':['name']}
{'0.47212':['significant','trying'],'0.372225':['vector','words','each','get']}
{'0.314534':['decreasing','idf','list','order','tf','values','want'],'0.247983':['each','get','name','vector','words']}
{} # empty dict Since the document consists only stopwords

Solution

  • I think this code does what you want and avoids using pandas:

    from itertools import groupby
    
    sort_func = lambda v: v[0] # sort by first value in tuple
    all_dicts = []
    for row in tfidf_matrix.toarray():
        sorted_vals = sorted(zip(row, tfidf_vect.get_feature_names()), key=sort_func, reverse=True)
        all_dicts.append({val:[g[1] for g in group] for val, group in groupby(sorted_vals, key=sort_func) if val != 0})
        
    

    You could make it even less readable and put it all in a single comprehension! :-)