mysqlsqlsplitn-gram

SQL compose bi-gram and search if exists in other table


In SQL, having a table T1 contains

TITLE
age 5 alton john live
show must go on

Having a table T2 contains

NAME. DESCRIPTION
John Bo altonjohn for kids
Alton show age5 mustgo kids

I would like to finding bigrams (pairs of consecutive words) in TITLE (T1) and check if at list 1 bigram exists in DESCRIPTION (T2) and return TITLE, DESCRIPTION & the BI-GRAM

Expected Output:

TITLE DESCRIPTION . BIGRAM
age 5 alton john live. altonjohn for kids. . altonjohn
age 5 alton john live. show age5 mustgo kids . age5
show must go on show age5 mustgo kids . mustgo

Solution

  • A slight variation of the previous query should do this easily:

    WITH RECURSIVE cte AS (
        SELECT TITLE, 
               LENGTH(TITLE)-LENGTH(REPLACE(TITLE,' ','')) AS num_bigrams,
               SUBSTRING_INDEX(
                   SUBSTRING_INDEX(TITLE, ' ', 
                                   LENGTH(TITLE)-LENGTH(REPLACE(TITLE,' ',''))+1
                   ), ' ', -2) AS bigram
        FROM t1
        UNION ALL
        SELECT TITLE,
               num_bigrams - 1 AS num_bigrams,
               SUBSTRING_INDEX(SUBSTRING_INDEX(TITLE, ' ', num_bigrams), ' ', -2)
        FROM cte
        WHERE num_bigrams > 1
    )
    SELECT TITLE, DESCRIPTION, bigram
    FROM       cte
    INNER JOIN t2
            ON t2.DESCRIPTION REGEXP CONCAT('( |^)', cte.bigram, '( |$)')
    

    Differences are:

    Check the demo here.

    Note: if you want to remove the middle space from the bigram, you just need to add a REPLACE function that removes that extra space.