pythonpandasdictionarytextmatching

Exact Match Words 'Keys' in a Dictionary to Pandas DataFrame Column & return appropriate value


Just want to preface that this question evolved from a previous question of mine which can be found here. I had a few follow-up's that ultimately changed the original question so here we are..

Assume we have the following dataframe:

d = {'keywords' :['cheapest cheap shoes', 'luxury shoes', 'cheap hiking shoes','liverpool']}
keywords = pd.DataFrame(d,columns=['keywords'])
In [7]: keywords
Out[7]:
    keywords
0  cheapest cheap shoes
1  luxury shoes
2  cheap hiking shoes
3  liverpool

A dictionary is then created which houses keywords that I would like to match to the values within the DataFrame

labels = {'cheape' : 'budget', 'cheap' : 'budget', 'luxury' : 'expensive', 
'hiking' : 'sport', 'pool': 'pool'}

The original answer that was provided to me helped solved the issue with matching keys in the dictionary

d = {'keywords' :['cheapest cheap shoes', 'luxury shoes', 'cheap hiking 
shoes','liverpool']}

keywords = pd.DataFrame(d,columns=['keywords'])

labels = {'cheape' : 'budget', 'cheap' : 'budget', 'luxury' : 
'expensive','hiking' : 'sport', 'pool': 'pool'}

df = pd.DataFrame(d)

def matcher(k):
    x = (i for i in labels if i in k)
    return ' | '.join(map(labels.get, x))

df['values'] = df['keywords'].map(matcher)

                keywords    values
0   cheapest cheap shoes    budget | budget
1   luxury shoes            expensive
2   cheap hiking shoes      budget | sport
3   liverpool               pool

However, I was running into an issue with matches resulting from partial matches. In the output above note how cheape will match to "cheapest" and pool will match to "liverpool"

So my question is: Is there a way for me to exact match my dictionary to the values in keyword so that the partial matches are skipped?

My desired outcome would be:

                keywords    values
0   cheapest cheap shoes    budget
1   luxury shoes            expensive
2   cheap hiking shoes      budget | sport
3   liverpool               N/A   

Side note - the dictionary will expand to include keys that are tied to the same value. This is to capture any spelling variations or misspellings e.g. {'car' : 'Automobile', 'cars' : 'Automobile', 'carss' : 'Automobile'} which is why I'd like to exact match to prevent any duplicate / non-relevant values from showing up.

Cheers


Solution

  • Here is a solution in line with my first one. str.split(' ') splits a string by whitespace.

    import pandas as pd
    
    d = {'keywords' :['cheapest cheap shoes', 'luxury shoes',
                      'cheap hiking shoes', 'liverpool']}
    
    keywords = pd.DataFrame(d, columns=['keywords'])
    
    labels = {'cheape': 'budget', 'cheap': 'budget', 'luxury': 'expensive',
              'hiking': 'sport', 'pool':'pool'}
    
    df = pd.DataFrame(d)
    
    def matcher(k):
        x = (i for i in labels if i in k.split(' '))
        return ' | '.join(map(labels.get, x))
    
    df['values'] = df['keywords'].map(matcher)
    

    Result

                   keywords          values
    0  cheapest cheap shoes          budget
    1          luxury shoes       expensive
    2    cheap hiking shoes  budget | sport
    3             liverpool