I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.
keywords = ['fake', 'false', 'lie']
df1:
text | |
---|---|
19152 | I think she is the Corona Virus.... |
19154 | Boy you hate to see that. I mean seeing how it was contained and all. |
19155 | Tell her it’s just the fake flu, it will go away in a few days. |
19235 | Is this fake news? |
... | ... |
20540 | She’ll believe it’s just alternative facts. |
Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.
str.contains()
df1[df1['text'].str.contains("|".join(keywords))]
The problem with str.contains()
is that the result is not limited to the exact keywords. For example, it returns sentences with believe
(e.g., row 20540) because lie
is a substring of "believe"!
pandas.Series.isin
To find the rows including the exact keywords, I used pd.Series.isin:
df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]
Even though I see there are matches in df1, it doesn't return anything.
If text is as follows,
df1 = pd.DataFrame()
df1['text'] = [
"Dear Kellyanne, Please seek the help of Paula White I believe ...",
"trump saying it was under controll was a lie, ...",
"Her mouth should hanve been ... All the lies she has told ...",
"she'll believe ...",
"I do believe in ...",
"This value is false ...",
"This value is fake ...",
"This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']
First,
Simple way is this.
df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
text
5 This value is false ...
6 This value is fake ...
It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.
Second,
So if remove a special letter in the text data like
new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
Now It can catch the word "lie,".
text
1 trump saying it was under controll was a lie, ...
5 This value is false ...
6 This value is fake ...
Third,
It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python