pythonpandasstringlistmatch

Check if a string contains all words in a phrase from a list in python


I have a list of phrases and I need to be able to identify whether each row in a dataset contains all the words from any of the phrases in my list. Take my example problem below.

I have a dataset where the column "Search" contains some browser searches. I also have a list called "phrases" that contains the phrases I'm trying to find within the Search column.

import pandas as pd
import numpy as np

text = [('how to screenshot on mac', 0),
         ('how to take screenshot?', 0),
         ('how to take screenshot on windows', 0),
         ('when is christmas', 0),
         ('how many days until christmas', 0),
        ('how many weeks until christmas', 0),
        ('how much is the new google pixel 8', 0),
        ('which google pixel versions are available', 0),
        ('how do I do google search on my pixel phone 7a', 0)]
labels = ['Search','Random_Column']
df = pd.DataFrame.from_records(text, columns=labels)

phrases = ['mac screenshot', 'days until christmas', 'google pixel 7a']

I don't care about the order of the words within "phrases" and there can be other before before, within, and after the phrase, but I need to make sure that only the df rows that contain all the words within any of the phrases are identified. Therefore, the expected output would be like this:

                                           Search  Random_Column  Match
0                        how to screenshot on mac              0   True
1                         how to take screenshot?              0  False
2               how to take screenshot on windows              0  False
3                               when is christmas              0  False
4                   how many days until christmas              0   True
5                  how many weeks until christmas              0  False
6              how much is the new google pixel 8              0  False
7       which google pixel versions are available              0  False
8  how do I do google search on my pixel phone 7a              0   True

I have found a lot of solutions for instances where the "phrases" list is made up of single words (e.g. here, here, and here) but I'm struggling to find a solution where I need to match full phrases.

I also tried to implement this solution but could not get it to work for a dataset.


Solution

  • You have to loop over all phrase until you find a match. An efficient option would be to use sets (set.issubset) combined with any:

    # convert the phrases to set
    sets = [set(s.split()) for s in phrases]
    # [{'screenshot', 'mac'}, {'until', 'christmas', 'days'},
    #  {'google', 'pixel', '7a'}]
    
    # for each string, check if one of the sets is a subset
    # if a match is found, return True immediately
    df['Match'] = [any(S.issubset(lst) for S in sets)
                   for lst in map(str.split, df['Search'])]
    

    Output:

                                               Search  Random_Column  Match
    0                        how to screenshot on mac              0   True
    1                         how to take screenshot?              0  False
    2               how to take screenshot on windows              0  False
    3                               when is christmas              0  False
    4                   how many days until christmas              0   True
    5                  how many weeks until christmas              0  False
    6              how much is the new google pixel 8              0  False
    7       which google pixel versions are available              0  False
    8  how do I do google search on my pixel phone 7a              0   True