I have a table of names or companies or products that attracted duplicate records due to operator entry error.
I'm attempting to create a tool to manage this problem. It won't be a high-traffic page, but it still shouldn't kill the database when building the recordset. I have this query, which takes a couple of minutes to process (way too long):
SELECT
tab1.id as id1,
tab1.creative as creative1,
tab2.id as id2,
tab2.creative as creative2
FROM
creatives tab1,
creatives tab2
WHERE
SOUNDEX(tab1.creative)= SOUNDEX(tab2.creative)
AND
tab1.id<>tab2.id
AND
tab1.id=(
SELECT
MAX(id)
FROM
creatives tab
WHERE
SOUNDEX(tab.creative)=SOUNDEX(tab1.creative))
Now apart from taking way too long to return a result, the results are sometimes a bit too fuzzy. For instance, it's great that it finds these:
Convenery of Trades of Edinburgh | Convenery of Trades of Edinbubrgh
Crowdedlogic Theatre Company | Crowded Logic Theatre Company
But these seem way off:
Daniel Cope | Dan Willis & Obie
David Williams | David Holmes
Is there a faster and less fuzzy way of doing this?
There are two parts to your question.
First, why is your query slow?
Second, why does SOUNDEX()
have too many false positive matches? Is there a better way to come up with near matches than SOUNDEX()
?
Let's take 'em up one at a time.
First of all, let's try to speed up this query. Let's start by recasting it in standard SQL (eliminating the old-style JOIN).
SELECT
tab1.id as id1,
tab1.creative as creative1,
tab2.id as id2,
tab2.creative as creative2
FROM creatives AS tab1
JOIN creatives tab2
ON (
tab1.id < tab2.id /* don't duplicate pairs a/b b/a */
AND SOUNDEX(tab1.creative)= SOUNDEX(tab2.creative)
)
For now let's leave out the last clause of your query.
As you can see, this is going to evaluate the SOUNDEX function on the order of (n squared) times if you have n rows in your table.
I suggest you put in a new column to the table. Make it a text string. Call it compare_hash.
Then populate it like this:
UPDATE creatives
SET compare_hash = SOUNDEX(creative)
Then index it.
Then run this query:
SELECT
tab1.id as id1,
tab1.creative as creative1,
tab2.id as id2,
tab2.creative as creative2
FROM creatives AS tab1
JOIN creatives tab2
ON (
tab1.id < tab2.id /* don't duplicate pairs a/b b/a */
AND tab1.compare_hash = tab2.compare_hash
)
That should be a ton faster because it can use an index.
On to your second problem. Look, here's the deal: SOUNDEX() is designed to get lots of false positives. It's also designed for American English names. It's an old-timey Bell System telephone company information operator function designed to show multiple names when people ask for Bessie Schmidt when they want Bessie Smith.
You need to cook up some other comparison-hash functions and experiment with them. The thing that's cool about your extra table column is that you can do that like this. This example converts all your strings to lower case, then pulls out the spaces, then each of the vowels. So, the hash for "David Williams" will be "dvdwllms" which is different from "dvdhlms"
UPDATE creatives SET compare_hash = LOWER(creative);
UPDATE creatives SET compare_hash = REPLACE(compare_hash , ' ', '');
UPDATE creatives SET compare_hash = REPLACE(compare_hash , 'a', '');
UPDATE creatives SET compare_hash = REPLACE(compare_hash , 'e', '');
UPDATE creatives SET compare_hash = REPLACE(compare_hash , 'i', '');
UPDATE creatives SET compare_hash = REPLACE(compare_hash , 'o', '');
UPDATE creatives SET compare_hash = REPLACE(compare_hash , 'u', '');
Once you've made this compare_hash, you can run the same self join query.
(I've tried Levenshein distance for this kind of thing. The problem there is getting a consistent sameness metric for pairs of strings of different lengths.)
It's going to take some messing around and some elbow grease to get this done. How much programming you put into programming it depends, well, on http://xkcd.com/1205/