pythonduplicatespython-dedupe

there is any type in python dedupe library to cross phone match


I'm using the Dedupe library to match person records to each other. My data includes first_name,last_name, email,phone1,phone2,phone3 and address information.

Here is my question: I always want to match two records with 80% to 99% confidence if they have a matching first_name,last_name with (phone1,phone2,phone3,email and address) also i want to match cross phone number like phone1=phone2,phone1=phone3,phone2=phone3.

Here is an example of some of my code:

fields = [
{'field' : 'first_name','variable name': 'ffname','type': 'Exact'},
{'field' : 'last_name','variable name': 'lname','type': 'Exact'},
{'field' : 'email','variable name': 'email', 'type': 'Exact','Has Missing':True},
{'field' : 'phone1','variable name': 'phone1', 'type': 'Exact', 'Has Missing':True},
{'field' : 'phone2','variable name': 'phone2', 'type': 'Exact', 'Has Missing':True},
{'field' : 'phone3','variable name': 'phone3', 'type': 'Exact', 'Has Missing':True},
{'field' : 'address','variable name': 'addr','type': 'String','Has Missing':True}    
]

In the Dedupe library, is there any way for me to match cross phone number with first_name and last_name?


Solution

  • Looking at the documentation, there are two ways of doing that.

    The first one is tho use the set variable type.. The catch - set is similar to text in the way it compares strings - it looks at common terms, so from that perspective the phone numbers (123) 456-7890 is not the same as 4567890.

    The other alternative, which I believe is better, is to build a custom comparator. This comparator would take two lists of phone numbers and return a number. The lower the number, the better. This comparator can be based on the affine comparison algorithm which is already used for string variables. Here's an implementation:

    from affinegap import normalizedAffineGapDistance as affineGap
    
    def phonesComparator(f1, f2):
        distances = []
    
        for p1 in f1: 
            for p2 in f2:
                distances.append(affineGap(p1, p2))
        if distances:
            return min(distances) 
        else:
            return 200.0
    

    Here's I'm returning the minimum distance between any two phone numbers in the two lists. But - one can of course come up with alternative measures.

    One final note: when creating the records, one should place all the phones in a single field. That list should be a list of phone numbers (or the empty list if there are none).