I have a column which has text only. I need to extract top keywords from each row using TFIDF.
Example Input:
df['Text']
'I live in India',
'My favourite colour is Red',
'I Love Programming'
Expected output:
df[Text] df[Keywords]
'I live in India' 'live','India'
'My favourite colour is Red' 'favourite','colour','red'
'I Love Programming' 'love','programming'
How do i get this? I tried writing the below code
tfidf = TfidfVectorizer(max_features=300, ngram_range = (2,2))
Y = df['Text'].apply(lambda x: tfidf.fit_transform(x))
I am getting the below error Iterable over raw text documents expected, string object received.
Try below code if you want to tokenize your sentences:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
df = pd.DataFrame({'Text':['I live in India', 'My favourite colour is Red', 'I Love Programming']})
df['Keywords'] = df.Text.apply(lambda x: nltk.word_tokenize(x))
stops = list(stopwords.words('english'))
df['Keywords'] = df['Keywords'].apply(lambda x: [item for item in x if item.lower() not in stops])
df['Keywords'] = df['Keywords'].apply(', '.join)
print(df)
Text Keywords
0 I live in India live, India
1 My favourite colour is Red favourite, colour, Red
2 I Love Programming Love, Programming