t-sqlpattern-matchingdata-scrubbing

Check for typos comparing two strings in T-SQL


We have developed a series of business rules that determines a duplicate contact record, the basis of these rules are centred around first checking for the same name then comparing other fields like phone number, email, phone, etc.

The problem is only a small percentage of records are being captured and automatically scrubbed/merged.

To capture more records I would like to include or check for typos in the contacts name (e.g. Michael=Micheal).

Is there a good function I can use to check for typos, in order to return more accurate results? I would think a function that looks for a single character difference comparing two strings would do the trick.


Solution

  • Keep in mind that most string similarity measurement algorithms are computationally intensive and, depending on the volume of the job at hand, T-SQL might be a poor choice, performance-wise.

    In lieu of a string similarity measurement per-se, consider hash functions, in particularly ones that preserve the main "structure" of the words. The advantage of hash codes is they are computed just once, using only one string as input, and can then be used in [TSQL] filters with a plain equality predicate (unlike similarity measurements which imply that you run the algorithm for each possible reference string). A plausible hash code suggestion is SOUNDEX, which happens to be particularly well suited for typical variations in person and company names and which is also implemented "natively" as a TSQL function.

    It would probably be preferable to compute the soundex code for each individual word in the name field, for example producing two codes for an input like "Charles Darwin", three for "Jean Jacques Rousseau" etc. and for improved performance, you may need to find a way of differentiating the Surname from the given name, as to facilitate your filter condition.

    If you prefer working with string similarity methods, I found that either the Levenstein distance or the Ratcliff/Oberhelp measure work rather well for dealing with small variations such as typos. As with the Soundex, you may still consider handling words separately, which then introduces the difficulty of dealing with multiple values for a given name entry, but also allows a more active handling ofthe typical situation with names, whereby some instances are the order first name then last name and the other instances in the reverse order (or whereby parts of the name are omitted or abbreviated).