I have a large file of administrative data, about 1 million records. Individual people can be represented multiple times in this dataset. About half the records have an identifying code that maps records to individuals; for the half that don't, I need to fuzzy match names to flag records that potentially belong to the same person.
From looking at the records with the identifying code, I've created a list of differences that have occurred in the recording of names for the same individual:
Given the types of matches I'm after, is there a better approach than using agrep()/levenshtein's distance, that is easily implemented in R?
Edit: agrep() in R doesn't do a very good job with this problem - because of the large number of insertions and substitutions I need to allow to account for the ways names are recorded differently, a lot of false matches are thrown up.
I would make multiple passes.
"Jon .* Snow"
- Middle name
"Jon .*Snow"
- Second last name
Nicknames will require a dictionary of mappings from long form to short, there's no regular expression that'll handle his.
"Snow Jon"
- Reversal (duh)
agrep will handle minor misspellings.
You probably also want to tokenise your names into first-, middle- and last-.