I am trying to do fuzzy match and grouping using Python on multiple fields. I want to do the comparison on each column on a different fuzzy threshold. I tried to search on google but could not find any solution which can do deduplication and then create groups on different columns.
Input:
Name | Address |
---|---|
Robert | 9185 Pumpkin Hill St. |
Rob | 9185 Pumpkin Hill Street |
Mike | 1296 Tunnel St. |
Mike | Tunnel Street 1296 |
John | 6200 Beechwood Drive |
Output:
Group ID | Name | Address |
---|---|---|
1 | Robert | 9185 Pumpkin Hill St. |
1 | Rob | 9185 Pumpkin Hill Street |
2 | Mike | 1296 Tunnel St. |
2 | Mike | Tunnel Street 1296 |
3 | John | 6200 Beechwood Drive |
I'd recommend reviewing Levenstein distance as this is a common algorithm to identify similar strings. Library FuzzWuzzy(goofy name I know) implements it with 3 different approaches. See this article for more info
Here's a starting place that compares each string against every other string. You mention having different thresholds, so all would need to do is loop through l_match
and group them depending on your desired thresholds
#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz
l_data =[
['Robert','9185 Pumpkin Hill St.']
,['Rob','9185 Pumpkin Hill Street']
,['Mike','1296 Tunnel St.']
,['Mike','Tunnel Street 1296']
,['John','6200 Beechwood Drive']
]
l_match = []
#loop through data
for idx1,row1 in enumerate(l_data):
#compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
for idx2,row2 in enumerate(l_data[idx1+1:]):
#Calculates index in original array for row2
origIdx=idx1+idx2+1
l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])
#Print raw data with index
for idx,val in enumerate(l_data):
print(f'{idx}-{val}')
print ("*" * 100)
#Print results of comparison
for row in l_match:
id1 = row[0]
id2 = row[1]
formattedName1 = f'{id1}-{l_data[id1][0]}'
formattedName2 = f'{id2}-{l_data[id2][0]}'
print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')
0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio