sqloracle

UTL_MATCH Performance


I've got 2 tables 1 with millions of records and the other with 200k. I'm attempting to do a UTL match with a couple of concatenated columns but of course this process would take a significant amount of time (comparing each record millions of times). The x.code is just another requirement for the comparison of table1 Is there any way that this could be sped up? It took 18min to compare 100 rows

select * 

from table1 x
where exists(select * 
             from largetable y 
             where x.state = y.state and 
                   utl_match.jaro_winkler_similarity(x.address || ' ' || x.city, y.address || ' ' || y.city) > 95 and 
                   x.code like '%a%'
            );

Solution

  • You are doing a distance match. By that logic there must be some similarity between the fields that can be indexed. Find some function which produces an indexable result, and use that for the search space for each row.

    For example, you have a City field - most normalized city names are going to be an exact match (and you could normalize distinct values of city to a separate table to fix spelling errors). Once you have a city match, you have likely reduced your search space for address by a fair bit.

    If that is not enough, produce three "match" keys of first two alpha characters of first word of address, last two alpha characters of first word of address, numbers present in address. Use the three "match" keys as a winnowing function prior to considering edit distance.

    For example, in matching "123 Mulberry lane" to "123 Mlberry ln", the steps would be:

    123 Mulberry Lane -> { numbers: '123', first: 'MU', last: 'RY' }
    123 Mlberry Ln    -> { numbers: '123', first: 'ML', last: 'RY' }
    

    So, while the first character winnow would cause the address to not be considered - the numbers and last heuristic still catch it.