So I have a table where I have identified fuzzy matches and an amount. I want to be able to summarize the amount by this common key.
My Data looks like this:
Name | Match1 | Match2 | Amount |
---|---|---|---|
Jame | James | Jim | 1 |
Jame | James | Jim | 2 |
Mike | Mikes | Miike | 3 |
James | Jame | Jim | 4 |
Jim | Jame | James | 5 |
Mikes | Mike | Miike | 6 |
Miike | Mike | Mikes | 7 |
data = {'Name': {0: 'Jame',
1: 'Jame',
2: 'Mike',
3: 'James',
4: 'Jim',
5: 'Mikes',
6: 'Miike'},
'Match1': {0: 'James',
1: 'James',
2: 'Mikes',
3: 'Jame',
4: 'Jame',
5: 'Mike',
6: 'Mike'},
'Match2': {0: 'Jim',
1: 'Jim',
2: 'Miike',
3: 'Jim',
4: 'James',
5: 'Miike',
6: 'Mikes'},
'Amount': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}
df = pd.DataFrame.from_dict(data)
Key | Name | Match1 | Match2 | Amount |
---|---|---|---|---|
Jame | Jame | James | Jim | 1 |
Jame | Jame | James | Jim | 2 |
Mike | Mike | Mikes | Miike | 3 |
Jame | James | Jame | Jim | 4 |
Jame | Jim | Jame | James | 5 |
Mike | Mikes | Mike | Miike | 6 |
Mike | Miike | Mike | Mikes | 7 |
So I can summarize my data like this:
Name | Amount |
---|---|
Jame | 12 |
Mike | 16 |
If you initially have the list of keys for your dataframe, you can do this.
mylist = ['Jame','Mike']
df['Key'] = [j for i in df.to_numpy() for j in mylist if j in i]
df.set_index(df.columns[-1]).reset_index(inplace=True)
Key Name Match1 Match2 Amount
0 Jame Jame James Jim 1
1 Jame Jame James Jim 2
2 Mike Mike Mikes Miike 3
3 Jame James Jame Jim 4
4 Jame Jim Jame James 5
5 Mike Mikes Mike Miike 6
6 Mike Miike Mike Mikes 7
And then, you can do groupby.sum
to get your desired output.
df.groupby('Key')['Amount'].sum().reset_index()
Key Amount
0 Jame 12
1 Mike 16
To obtain the key through the dataframe, you can use the process.extractOne
in fuzzywuzzy
to compare the matching percentage of every element in a row. If it's less than the threshold, we will update the key.
from fuzzywuzzy import process
key = [df.Name[0]]
for row in df.to_numpy():
for i in row:
if 50 < process.extractOne(i, key)[1]:
break
else:
key.append(row[0])
break
key
Out[65]: ['Jame', 'Mike']