mysqlregexregexp-replace

How to use REGEXP_REPLACE to convert a verbal BOOLEAN collection of words to MATCH AGAINST format? [MySQL 8.0.36]


I need to convert a text query from standard "verbal" boolean to the FULLLTEXT MATCH AGAINST + and - system.

For example; this value:

((word1 AND word2) OR (word3 AND word4 AND word5)) OR word6 NOT word7

Needs to be converted to:

((+word1 +word2) (+word3 +word4 +word5) word6 -word7

The issue I'm having is when I try to use REGEXP_REPLACE with a pattern like this:

\\b([[:alnum:]_]+)\\b

I can't figure out how to have the function use the word that matched to append/amend to it. I've seen other examples trying to use \1 or something like that to represent the matching value, but it simply doesn't work.

So how can I command REGEXP_REPLACE to dynamically scan a string for "(word +word" and replace it with "(+word +word" ? Also needs to do a leading simple one like:

It's also a bit more complex, because if we have:

(word1 AND word2) AND (word3 OR word4)

That would need to be:

+(+word1 +word2) +(word3 word4)

Any guidance on the best way to do this without writing an ultra slow parser would be appreciated.


Solution

  • I figured out how to convert a AND/OR/NOT boolean statement like the below to a compatible MATCH AGAINST solution, and hope this helps other people.

    This converts:

    (word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7
    

    TO

    (WORD1 WORD2) (+WORD3 +WORD4 +WORD5) WORD6 -WORD7
    

    CODE:

    SET in_text_search = '(word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7';
    SET text_reformatted = UCASE(TRIM(in_text_search));
    SET text_reformatted = REPLACE(text_reformatted,'  ',' '); -- Strip extra spaces
    
    -- Make sure the count of parentheses (if exist) is equal.
    IF (INSTR(text_reformatted,'(') > 0 OR INSTR(text_reformatted,')') > 0) THEN
            SET count_character = LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,'(',''));
            IF (count_character <> (LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,')','')))) THEN
                    -- TRIGGER ERROR 'imbalanced parentheses';
                    LEAVE SP;
            END IF;
    END IF;
    
    SET text_reformatted = REPLACE(text_reformatted,' AND ',' +');
    SET text_reformatted = REPLACE(text_reformatted,' OR ',' ');
    SET text_reformatted = REPLACE(text_reformatted,' NOT ',' -');
    
    -- Primary replace
    SET text_reformatted = REGEXP_REPLACE(text_reformatted, '(?<![-+])\\b(\\w+)\\b(?=(?:\\s*\\+|\\s*$))', '+$1', 1, 0, 'c');
    
    -- Cleanup the cart before the horse, for words like .NET etc.
    SET text_reformatted = REPLACE(text_reformatted,'.+','+.');
    SET text_reformatted = REPLACE(text_reformatted,'.-','-.');