I'm trying to join a set of county names from one table with county names in another table. The issue here is that, the county names in both tables are not normalized. They are not same in count; also, they may not be appearing in similar pattern always. For instance, the county 'SAINT JOHNS' in "Table A" may be represented as 'ST JOHNS' in "Table B". We cannot predict a common pattern for them.
That means , we cannot use "equal to" (=
) condition while joining. So, I'm trying to join them using the JARO_WINKLER_SIMILARITY
function in oracle.
My Left Outer Join condition would be like:
Table_A.State = Table_B.State
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80
I've given the measure 80 after some testing of the results and it seemed to be optimal.
Here, the issue is that I'm getting set of "false Positives" when joining. For instance, if there are some counties with similarity in names under the same state ("BARRY'and "BAY" for example), they will be matched if the measure is >=80
.
This creates inaccurate set of joined data.
Can anyone please suggest some work around?
Thanks, DAV
Can you plz help me to build a query that will lookup Table_A for each record in Table B/C/D, and match against the county name in A with highest ranked similarity that is >=80
Oracle Setup:
CREATE TABLE official_words ( word ) AS
SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
SELECT 'MONTGOMERY' FROM DUAL UNION ALL
SELECT 'MONROE' FROM DUAL UNION ALL
SELECT 'SAINT JAMES' FROM DUAL UNION ALL
SELECT 'BOTANY BAY' FROM DUAL;
CREATE TABLE words_to_match ( word ) AS
SELECT 'SAINT JOHN' FROM DUAL UNION ALL
SELECT 'ST JAMES' FROM DUAL UNION ALL
SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
SELECT 'MONROE ST' FROM DUAL;
Query:
SELECT *
FROM (
SELECT wtm.word,
ow.word AS official_word,
UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
FROM words_to_match wtm
INNER JOIN
official_words ow
ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;
Output:
WORD OFFICIAL_WO SIMILARITY RN
-------------- ----------- ---------- ----------
MONROE ST MONROE 93 1
MONTGOMERY BAY MONTGOMERY 94 1
SAINT JOHN SAINT JOHNS 98 1
ST JAMES SAINT JAMES 80 1