I need to convert a text query from standard "verbal" boolean to the FULLLTEXT MATCH AGAINST + and - system.
For example; this value:
((word1 AND word2) OR (word3 AND word4 AND word5)) OR word6 NOT word7
Needs to be converted to:
((+word1 +word2) (+word3 +word4 +word5) word6 -word7
The issue I'm having is when I try to use REGEXP_REPLACE with a pattern like this:
\\b([[:alnum:]_]+)\\b
I can't figure out how to have the function use the word that matched to append/amend to it. I've seen other examples trying to use \1 or something like that to represent the matching value, but it simply doesn't work.
So how can I command REGEXP_REPLACE to dynamically scan a string for "(word +word" and replace it with "(+word +word" ? Also needs to do a leading simple one like:
It's also a bit more complex, because if we have:
(word1 AND word2) AND (word3 OR word4)
That would need to be:
+(+word1 +word2) +(word3 word4)
Any guidance on the best way to do this without writing an ultra slow parser would be appreciated.
I figured out how to convert a AND/OR/NOT boolean statement like the below to a compatible MATCH AGAINST solution, and hope this helps other people.
This converts:
(word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7
TO
(WORD1 WORD2) (+WORD3 +WORD4 +WORD5) WORD6 -WORD7
CODE:
SET in_text_search = '(word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7';
SET text_reformatted = UCASE(TRIM(in_text_search));
SET text_reformatted = REPLACE(text_reformatted,' ',' '); -- Strip extra spaces
-- Make sure the count of parentheses (if exist) is equal.
IF (INSTR(text_reformatted,'(') > 0 OR INSTR(text_reformatted,')') > 0) THEN
SET count_character = LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,'(',''));
IF (count_character <> (LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,')','')))) THEN
-- TRIGGER ERROR 'imbalanced parentheses';
LEAVE SP;
END IF;
END IF;
SET text_reformatted = REPLACE(text_reformatted,' AND ',' +');
SET text_reformatted = REPLACE(text_reformatted,' OR ',' ');
SET text_reformatted = REPLACE(text_reformatted,' NOT ',' -');
-- Primary replace
SET text_reformatted = REGEXP_REPLACE(text_reformatted, '(?<![-+])\\b(\\w+)\\b(?=(?:\\s*\\+|\\s*$))', '+$1', 1, 0, 'c');
-- Cleanup the cart before the horse, for words like .NET etc.
SET text_reformatted = REPLACE(text_reformatted,'.+','+.');
SET text_reformatted = REPLACE(text_reformatted,'.-','-.');