I have 3 dataset of customers with 7 columns.
CustomerName
Address
Phone
StoreName
Mobile
Longitude
Latitude
every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same weight in this matching. How i can handle it???? Do you know good library for my case?
I think Recordlinkage library would suit your purposes
you can use to the Compare object , requiring various kinds of matches:
compare_cl.exact('CustomerName', 'CustomerName', label='CustomerName')
compare_cl.string('StoreName', 'StoreName', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.string('Address', 'Address', threshold=0.85, label='Address')
then defining the match you can customize how you want results, ie if you want 2 features to be matched at least
features = compare_cl.compute(pairs, df)
matches = features[features.sum(axis=1) > 3]