rstring-matchingagrep

Alternative approach to using agrep() for fuzzy matching in R


I have a large file of administrative data, about 1 million records. Individual people can be represented multiple times in this dataset. About half the records have an identifying code that maps records to individuals; for the half that don't, I need to fuzzy match names to flag records that potentially belong to the same person.

From looking at the records with the identifying code, I've created a list of differences that have occurred in the recording of names for the same individual:

Given the types of matches I'm after, is there a better approach than using agrep()/levenshtein's distance, that is easily implemented in R?

Edit: agrep() in R doesn't do a very good job with this problem - because of the large number of insertions and substitutions I need to allow to account for the ways names are recorded differently, a lot of false matches are thrown up.


Solution

  • I would make multiple passes.

    "Jon .* Snow" - Middle name

    "Jon .*Snow" - Second last name

    Nicknames will require a dictionary of mappings from long form to short, there's no regular expression that'll handle his.

    "Snow Jon" - Reversal (duh)

    agrep will handle minor misspellings.

    You probably also want to tokenise your names into first-, middle- and last-.