pythonfuzzy-logicfuzzyfuzzy-comparison

Fuzzy matching and grouping


I am trying to do fuzzy match and grouping using Python on multiple fields. I want to do the comparison on each column on a different fuzzy threshold. I tried to search on google but could not find any solution which can do deduplication and then create groups on different columns.

Input:

Name Address
Robert 9185 Pumpkin Hill St.
Rob 9185 Pumpkin Hill Street
Mike 1296 Tunnel St.
Mike Tunnel Street 1296
John 6200 Beechwood Drive

Output:

Group ID Name Address
1 Robert 9185 Pumpkin Hill St.
1 Rob 9185 Pumpkin Hill Street
2 Mike 1296 Tunnel St.
2 Mike Tunnel Street 1296
3 John 6200 Beechwood Drive

Solution

  • I'd recommend reviewing Levenstein distance as this is a common algorithm to identify similar strings. Library FuzzWuzzy(goofy name I know) implements it with 3 different approaches. See this article for more info

    Here's a starting place that compares each string against every other string. You mention having different thresholds, so all would need to do is loop through l_match and group them depending on your desired thresholds

    
    #Run this to install the required libraries
    #pip install python-levenshtein fuzzywuzzy
    from fuzzywuzzy import fuzz
    
    l_data =[
         ['Robert','9185 Pumpkin Hill St.']
        ,['Rob','9185 Pumpkin Hill Street']
        ,['Mike','1296 Tunnel St.']
        ,['Mike','Tunnel Street 1296']
        ,['John','6200 Beechwood Drive']
    ]
    l_match = []
    
    #loop through data
    for idx1,row1 in enumerate(l_data):
        #compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
        for idx2,row2 in enumerate(l_data[idx1+1:]):
            #Calculates index in original array for row2
            origIdx=idx1+idx2+1
            l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])
    
    #Print raw data with index
    for idx,val in enumerate(l_data):
        print(f'{idx}-{val}')
    print ("*" * 100)
    
    #Print results of comparison
    for row in l_match:
        id1 = row[0]
        id2 = row[1]
        formattedName1 = f'{id1}-{l_data[id1][0]}'
        formattedName2 = f'{id2}-{l_data[id2][0]}'
        print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')
    

    Results:

    0-['Robert', '9185 Pumpkin Hill St.']
    1-['Rob', '9185 Pumpkin Hill Street']
    2-['Mike', '1296 Tunnel St.']
    3-['Mike', 'Tunnel Street 1296']
    4-['John', '6200 Beechwood Drive']
    ****************************************************************************************************
    0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
    0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
    0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
    0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
    1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
    1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
    1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
    2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
    2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
    3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio