mysqlfuzzy-searchsoundex

Improve finding fuzzy duplicates using MySQL


I have a table of names or companies or products that attracted duplicate records due to operator entry error.

I'm attempting to create a tool to manage this problem. It won't be a high-traffic page, but it still shouldn't kill the database when building the recordset. I have this query, which takes a couple of minutes to process (way too long):

SELECT 
    tab1.id as id1, 
    tab1.creative as creative1, 
    tab2.id as id2, 
    tab2.creative as creative2 
FROM 
    creatives tab1, 
    creatives tab2 
WHERE 
    SOUNDEX(tab1.creative)= SOUNDEX(tab2.creative) 
AND 
    tab1.id<>tab2.id 
AND 
    tab1.id=(
        SELECT 
            MAX(id) 
        FROM 
            creatives tab 
        WHERE 
            SOUNDEX(tab.creative)=SOUNDEX(tab1.creative))

Now apart from taking way too long to return a result, the results are sometimes a bit too fuzzy. For instance, it's great that it finds these:

Convenery of Trades of Edinburgh | Convenery of Trades of Edinbubrgh
Crowdedlogic Theatre Company | Crowded Logic Theatre Company

But these seem way off:

Daniel Cope | Dan Willis & Obie
David Williams | David Holmes

Is there a faster and less fuzzy way of doing this?


Solution

  • There are two parts to your question.

    First, why is your query slow?

    Second, why does SOUNDEX() have too many false positive matches? Is there a better way to come up with near matches than SOUNDEX()?

    Let's take 'em up one at a time.

    First of all, let's try to speed up this query. Let's start by recasting it in standard SQL (eliminating the old-style JOIN).

    SELECT 
        tab1.id as id1, 
        tab1.creative as creative1, 
        tab2.id as id2, 
        tab2.creative as creative2 
    FROM creatives AS tab1 
    JOIN creatives tab2 
      ON (
              tab1.id < tab2.id   /* don't duplicate pairs a/b b/a */  
          AND SOUNDEX(tab1.creative)= SOUNDEX(tab2.creative) 
         )
    

    For now let's leave out the last clause of your query.

    As you can see, this is going to evaluate the SOUNDEX function on the order of (n squared) times if you have n rows in your table.

    I suggest you put in a new column to the table. Make it a text string. Call it compare_hash.

    Then populate it like this:

    UPDATE creatives 
       SET compare_hash = SOUNDEX(creative)
    

    Then index it.

    Then run this query:

        SELECT 
        tab1.id as id1, 
        tab1.creative as creative1, 
        tab2.id as id2, 
        tab2.creative as creative2 
    FROM creatives AS tab1 
    JOIN creatives tab2 
      ON (
              tab1.id < tab2.id   /* don't duplicate pairs a/b b/a */  
          AND tab1.compare_hash = tab2.compare_hash
         )
    

    That should be a ton faster because it can use an index.

    On to your second problem. Look, here's the deal: SOUNDEX() is designed to get lots of false positives. It's also designed for American English names. It's an old-timey Bell System telephone company information operator function designed to show multiple names when people ask for Bessie Schmidt when they want Bessie Smith.

    You need to cook up some other comparison-hash functions and experiment with them. The thing that's cool about your extra table column is that you can do that like this. This example converts all your strings to lower case, then pulls out the spaces, then each of the vowels. So, the hash for "David Williams" will be "dvdwllms" which is different from "dvdhlms"

    UPDATE creatives  SET compare_hash = LOWER(creative);
    UPDATE creatives  SET compare_hash = REPLACE(compare_hash , ' ', '');
    UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'a', '');
    UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'e', '');
    UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'i', '');
    UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'o', '');
    UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'u', '');
    

    Once you've made this compare_hash, you can run the same self join query.

    (I've tried Levenshein distance for this kind of thing. The problem there is getting a consistent sameness metric for pairs of strings of different lengths.)

    It's going to take some messing around and some elbow grease to get this done. How much programming you put into programming it depends, well, on http://xkcd.com/1205/