I am trying to implement a code that reads a csv file, creates a data frame out of it, and then tags each row the name of the key, if one of the columns in that row contains the same string as in the key of the dataframe.
As an example, I have the following dictionary defined:
Sdiction={
"Mgage" : ["ABC Gage","XYZ Gage"],
"Rate" : ["deg/min","rad/s","rpm"]}
And I have the following dataframe:
Col A | Col B | Col C | Col D |
---|---|---|---|
1 | 30 | ABC Gage | |
2 | 45 | deg/min | |
3 | 150 | Gage |
I want to tag Col D for each row as
Row 1 - Col D = Mgage (since ABC Gage exists in the key Mgage)
Row 2 - Col D = Rate (Since deg/min exists in the key Rate)
Row 3 - Col D = Mgage (since the string Gage exists in the key Mgage, albeit partially)
Expected output:
Col A | Col B | Col C | Col D |
---|---|---|---|
1 | 30 | ABC Gage | Mgage |
2 | 45 | deg/min | Rate |
3 | 150 | Gage | Mgage |
I am trying to figure out how to implement this part, have not yet implemented it, and therefore need help.
Using a regex match:
import re
s = df['Col C'].str.casefold()
pattern = '(%s)' % '|'.join(map(re.escape, s))
# '(abc\\ gage|deg/min|gage)'
# reverse dictionary
tmp = pd.Series({v.casefold(): k for k, l in Sdiction.items()
for v in l}, name='ref').reset_index()
# extract first match, map reference key
df['Col D'] = s.map(tmp.assign(match=tmp['index'].str.extract(pattern))
.dropna(subset=['match'])
.set_index('match')['ref']
)
Output:
Col A Col B Col C Col D
0 1 30 ABC Gage Mgage
1 2 45 deg/min Rate
2 3 150 Gage Mgage