In SQL, having a table T1 contains
TITLE |
---|
age 5 alton john live |
show must go on |
Having a table T2 contains
NAME. | DESCRIPTION |
---|---|
John Bo | altonjohn for kids |
Alton | show age5 mustgo kids |
I would like to finding bigrams (pairs of consecutive words) in TITLE (T1) and check if at list 1 bigram exists in DESCRIPTION (T2) and return TITLE, DESCRIPTION & the BI-GRAM
Expected Output:
TITLE | DESCRIPTION | . BIGRAM |
---|---|---|
age 5 alton john live. | altonjohn for kids. | . altonjohn |
age 5 alton john live. | show age5 mustgo kids | . age5 |
show must go on | show age5 mustgo kids | . mustgo |
A slight variation of the previous query should do this easily:
WITH RECURSIVE cte AS (
SELECT TITLE,
LENGTH(TITLE)-LENGTH(REPLACE(TITLE,' ','')) AS num_bigrams,
SUBSTRING_INDEX(
SUBSTRING_INDEX(TITLE, ' ',
LENGTH(TITLE)-LENGTH(REPLACE(TITLE,' ',''))+1
), ' ', -2) AS bigram
FROM t1
UNION ALL
SELECT TITLE,
num_bigrams - 1 AS num_bigrams,
SUBSTRING_INDEX(SUBSTRING_INDEX(TITLE, ' ', num_bigrams), ' ', -2)
FROM cte
WHERE num_bigrams > 1
)
SELECT TITLE, DESCRIPTION, bigram
FROM cte
INNER JOIN t2
ON t2.DESCRIPTION REGEXP CONCAT('( |^)', cte.bigram, '( |$)')
Differences are:
SUBSTRING_INDEX
function, to recover the last two words instead of the last one [you can generalize on this for trigrams and others too], both for the base and the recursive step of the recursion.WHERE num_bigrams > 1
instead of WHERE num_bigrams > 0
.Check the demo here.
Note: if you want to remove the middle space from the bigram, you just need to add a REPLACE function that removes that extra space.