sqlprestoset-intersection

Is this an efficient way to find overlap of two string columns


I have a table with columns name and full name, and am interested in knowing whether one is part of the other (but they are not equal).
Example:

name        |  full name      |  is_subset
-------------------------------------------
john smith  |  john smith     |  false
john smith  |  john h. smith  |  true
john smith  |  alice jones    |  false

I found a method that works using SPLIT and array intersections, but was curious as to whether there is a simpler or more efficient method.

SELECT 
name,
full name,
CARDINALITY(
    (
        ARRAY_INTERSECT(
            SPLIT(name, ' '),
            SPLIT(full name, ' ')
        )
    )
) = CARDINALITY(SPLIT(name, ' ')) AND CARDINALITY(
    SPLIT(full name, ' ')
) > CARDINALITY(SPLIT(name, ' ')) AS is_subset
from t

Solution

  • -- name is a logical subset of fullname
    ARRAY_EXCEPT(
       SPLIT(name, ' '),
       SPLIT(fullname, ' ')
    )
    =
    ARRAY []
    
    AND
    
    -- name and fullname are unequal sets
    ARRAY_EXCEPT(
       SPLIT(fullname, ' '),
       SPLIT(name, ' ')
    )
    !=
    ARRAY []
    

    The second condition can probably be reduced to...

    AND
    
    LENGTH(name) < LENGTH(fullname)
    

    (The shorter LENGTH() version can be "tricked" by certain uses of duplicates.)