Just want to preface that this question evolved from a previous question of mine which can be found here. I had a few follow-up's that ultimately changed the original question so here we are..
Assume we have the following dataframe:
d = {'keywords' :['cheapest cheap shoes', 'luxury shoes', 'cheap hiking shoes','liverpool']}
keywords = pd.DataFrame(d,columns=['keywords'])
In [7]: keywords
Out[7]:
keywords
0 cheapest cheap shoes
1 luxury shoes
2 cheap hiking shoes
3 liverpool
A dictionary is then created which houses keywords that I would like to match to the values within the DataFrame
labels = {'cheape' : 'budget', 'cheap' : 'budget', 'luxury' : 'expensive',
'hiking' : 'sport', 'pool': 'pool'}
The original answer that was provided to me helped solved the issue with matching keys in the dictionary
d = {'keywords' :['cheapest cheap shoes', 'luxury shoes', 'cheap hiking
shoes','liverpool']}
keywords = pd.DataFrame(d,columns=['keywords'])
labels = {'cheape' : 'budget', 'cheap' : 'budget', 'luxury' :
'expensive','hiking' : 'sport', 'pool': 'pool'}
df = pd.DataFrame(d)
def matcher(k):
x = (i for i in labels if i in k)
return ' | '.join(map(labels.get, x))
df['values'] = df['keywords'].map(matcher)
keywords values
0 cheapest cheap shoes budget | budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
3 liverpool pool
However, I was running into an issue with matches resulting from partial matches. In the output above note how cheape will match to "cheapest" and pool will match to "liverpool"
So my question is: Is there a way for me to exact match my dictionary to the values in keyword so that the partial matches are skipped?
My desired outcome would be:
keywords values
0 cheapest cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
3 liverpool N/A
Side note - the dictionary will expand to include keys that are tied to the same value. This is to capture any spelling variations or misspellings e.g. {'car' : 'Automobile', 'cars' : 'Automobile', 'carss' : 'Automobile'}
which is why I'd like to exact match to prevent any duplicate / non-relevant values from showing up.
Cheers
Here is a solution in line with my first one. str.split(' ')
splits a string by whitespace.
import pandas as pd
d = {'keywords' :['cheapest cheap shoes', 'luxury shoes',
'cheap hiking shoes', 'liverpool']}
keywords = pd.DataFrame(d, columns=['keywords'])
labels = {'cheape': 'budget', 'cheap': 'budget', 'luxury': 'expensive',
'hiking': 'sport', 'pool':'pool'}
df = pd.DataFrame(d)
def matcher(k):
x = (i for i in labels if i in k.split(' '))
return ' | '.join(map(labels.get, x))
df['values'] = df['keywords'].map(matcher)
Result
keywords values
0 cheapest cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
3 liverpool