I'm using the Dedupe library to match person records to each other. My data includes first_name,last_name, email,phone1,phone2,phone3 and address information.
Here is my question: I always want to match two records with 80% to 99% confidence if they have a matching first_name,last_name with (phone1,phone2,phone3,email and address) also i want to match cross phone number like phone1=phone2,phone1=phone3,phone2=phone3.
Here is an example of some of my code:
fields = [
{'field' : 'first_name','variable name': 'ffname','type': 'Exact'},
{'field' : 'last_name','variable name': 'lname','type': 'Exact'},
{'field' : 'email','variable name': 'email', 'type': 'Exact','Has Missing':True},
{'field' : 'phone1','variable name': 'phone1', 'type': 'Exact', 'Has Missing':True},
{'field' : 'phone2','variable name': 'phone2', 'type': 'Exact', 'Has Missing':True},
{'field' : 'phone3','variable name': 'phone3', 'type': 'Exact', 'Has Missing':True},
{'field' : 'address','variable name': 'addr','type': 'String','Has Missing':True}
]
In the Dedupe library, is there any way for me to match cross phone number with first_name and last_name?
Looking at the documentation, there are two ways of doing that.
The first one is tho use the set
variable type.. The catch - set is similar to text in the way it compares strings - it looks at common terms, so from that perspective the phone numbers (123) 456-7890 is not the same as 4567890.
The other alternative, which I believe is better, is to build a custom comparator. This comparator would take two lists of phone numbers and return a number. The lower the number, the better. This comparator can be based on the affine comparison algorithm which is already used for string variables. Here's an implementation:
from affinegap import normalizedAffineGapDistance as affineGap
def phonesComparator(f1, f2):
distances = []
for p1 in f1:
for p2 in f2:
distances.append(affineGap(p1, p2))
if distances:
return min(distances)
else:
return 200.0
Here's I'm returning the minimum distance between any two phone numbers in the two lists. But - one can of course come up with alternative measures.
One final note: when creating the records, one should place all the phones in a single field. That list should be a list of phone numbers (or the empty list if there are none).