pythonpandasmatchisin

Pandas isin() does not return anything even when the keywords exist in the dataframe


I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.

keywords = ['fake', 'false', 'lie']

df1:

text
19152 I think she is the Corona Virus....
19154 Boy you hate to see that. I mean seeing how it was contained and all.
19155 Tell her it’s just the fake flu, it will go away in a few days.
19235 Is this fake news?
... ...
20540 She’ll believe it’s just alternative facts.

Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.

str.contains()

df1[df1['text'].str.contains("|".join(keywords))]

The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!

pandas.Series.isin

To find the rows including the exact keywords, I used pd.Series.isin:

df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]

Even though I see there are matches in df1, it doesn't return anything.


Solution

  • If text is as follows,

    df1 = pd.DataFrame()
    df1['text'] = [
        "Dear Kellyanne, Please seek the help of Paula White I believe ...",
        "trump saying it was under controll was a lie, ...",
        "Her mouth should hanve been ... All the lies she has told ...",
        "she'll believe ...",
        "I do believe in ...",
        "This value is false ...",
        "This value is fake ...",
        "This song is fakelove ..."
    ]
    keywords = ['misleading', 'fake', 'false', 'lie']
    

    First,

    Simple way is this.

    df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
    
                          text
    5  This value is false ...
    6   This value is fake ...
    

    It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.

    Second,

    So if remove a special letter in the text data like

    new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
    df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
    

    Now It can catch the word "lie,".

                                                    text
    1  trump saying it was under controll was a lie, ...
    5                            This value is false ...
    6                             This value is fake ...
    

    Third,

    It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python