pythonpython-3.xtf-idftfidfvectorizerkeyword-extraction

How to extract keywords using TFIDF for each row in python?


I have a column which has text only. I need to extract top keywords from each row using TFIDF.

Example Input:

df['Text']
'I live in India',
'My favourite colour is Red', 
'I Love Programming'

Expected output:

 df[Text]                            df[Keywords]
'I live in India'                  'live','India'
'My favourite colour is Red'       'favourite','colour','red'
'I Love Programming'               'love','programming'

How do i get this? I tried writing the below code

tfidf = TfidfVectorizer(max_features=300, ngram_range = (2,2))
Y = df['Text'].apply(lambda x: tfidf.fit_transform(x))

I am getting the below error Iterable over raw text documents expected, string object received.


Solution

  • Try below code if you want to tokenize your sentences:

    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    
    df = pd.DataFrame({'Text':['I live in India', 'My favourite colour is Red', 'I Love Programming']})
    df['Keywords'] = df.Text.apply(lambda x: nltk.word_tokenize(x))
    stops =  list(stopwords.words('english'))
    df['Keywords'] = df['Keywords'].apply(lambda x: [item for item in x if item.lower() not in stops])
    df['Keywords'] = df['Keywords'].apply(', '.join)
    
    print(df)
    
                             Text                Keywords
    0             I live in India             live, India
    1  My favourite colour is Red  favourite, colour, Red
    2          I Love Programming       Love, Programming