[SOLVED] Fuzzy matching and grouping

Fuzzy matching and grouping

I am trying to do fuzzy match and grouping using Python on multiple fields. I want to do the comparison on each column on a different fuzzy threshold. I tried to search on google but could not find any solution which can do deduplication and then create groups on different columns.

Input:

Name	Address
Robert	9185 Pumpkin Hill St.
Rob	9185 Pumpkin Hill Street
Mike	1296 Tunnel St.
Mike	Tunnel Street 1296
John	6200 Beechwood Drive

Output:

Group ID	Name	Address
1	Robert	9185 Pumpkin Hill St.
1	Rob	9185 Pumpkin Hill Street
2	Mike	1296 Tunnel St.
2	Mike	Tunnel Street 1296
3	John	6200 Beechwood Drive

Solution

I'd recommend reviewing Levenstein distance as this is a common algorithm to identify similar strings. Library FuzzWuzzy(goofy name I know) implements it with 3 different approaches. See this article for more info

Here's a starting place that compares each string against every other string. You mention having different thresholds, so all would need to do is loop through l_match and group them depending on your desired thresholds


#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz

l_data =[
     ['Robert','9185 Pumpkin Hill St.']
    ,['Rob','9185 Pumpkin Hill Street']
    ,['Mike','1296 Tunnel St.']
    ,['Mike','Tunnel Street 1296']
    ,['John','6200 Beechwood Drive']
]
l_match = []

#loop through data
for idx1,row1 in enumerate(l_data):
    #compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
    for idx2,row2 in enumerate(l_data[idx1+1:]):
        #Calculates index in original array for row2
        origIdx=idx1+idx2+1
        l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])

#Print raw data with index
for idx,val in enumerate(l_data):
    print(f'{idx}-{val}')
print ("*" * 100)

#Print results of comparison
for row in l_match:
    id1 = row[0]
    id2 = row[1]
    formattedName1 = f'{id1}-{l_data[id1][0]}'
    formattedName2 = f'{id2}-{l_data[id2][0]}'
    print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')

Results:

0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio