[SOLVED] Python Record Linkage, Fuzzy Match and Deduplication

Python Record Linkage, Fuzzy Match and Deduplication

I have 3 dataset of customers with 7 columns.

CustomerName

Address

Phone

StoreName

Mobile

Longitude

Latitude

every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same weight in this matching. How i can handle it???? Do you know good library for my case?

Solution

I think Recordlinkage library would suit your purposes

you can use to the Compare object , requiring various kinds of matches:

compare_cl.exact('CustomerName', 'CustomerName', label='CustomerName')
compare_cl.string('StoreName', 'StoreName', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.string('Address', 'Address', threshold=0.85, label='Address')

then defining the match you can customize how you want results, ie if you want 2 features to be matched at least

features = compare_cl.compute(pairs, df)    
matches = features[features.sum(axis=1) > 3]