mysqlstringmatchingfuzzy-search

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?


My users will import through cut and paste a large string that will contain company names.

I have an existing and growing MYSQL database of companies names, each with a unique company_id.

I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.

Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **

For example, someone writes:

Microsoft       -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc.   -> Polycom

I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:

How to find best fuzzy match for a string in a large string database

Matching inexact company names in Java


Solution

  • You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

    The drawbacks of SOUNDEX() are:

    Example:

    SELECT SOUNDEX('Microsoft')
    SELECT SOUNDEX('Microsift')
    SELECT SOUNDEX('Microsift Corporation')
    SELECT SOUNDEX('Microsift Subsidary')
    
    /* all of these return 'M262' */
    

    For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

    Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

    In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).