phpmysqlregexword-boundaries

matching whole words while ignoring affixes of words using regex


I am learning a new language and I have created a DB with aprox. 2500 words and 2500 examples of the words. I created a PHP/MySQL web UI with basically shows pictures for each word and when you click them it will play the audio of the word. There is also a context menu to trigger a pop up div that matches and displays all examples where the word occurs.

I have been using REGEXP '[[:<:]]$word[[:>:]]' but there are several prefixes/suffixes of words that I want to filter out that do not add any real meaning to the word (like the suffix -ing in English). One way I have gotten around this is putting a hyphen in the word where the affix starts so the regex still matches the word but this isn't completely true to how the language handles the spelling. There are also different combinations of words that I do not want to filter because the meaning is completely different. Without getting into specifics here are some pseudo examples with the matched word as just "WORD" and prefixes and suffixes that I want to filter as pre1, pre2... and suf1, suf2... and the stuff I do not want to filter as xxx

1. Xxx xxx WORDsuf1 xxx xxx xxx.
2. Xxx xxx WORDsuf2 xxx xxx xxx.
3. Xxx xxx pre1WORDsuf1 xxx xxx xxx.
4. Xxx xxx WORD xxx xxx xxx.
5. Xxx xxx pre1WORD xxx xxx xxx.
6. Xxx xxx pre2WORDxxx xxx xxx xxx.
7. Xxx xxx xxxWORDxxx xxx xxx xxx.
8. Xxx xxx pre1WORDxxxsuf1 xxx xxx xxx.
9. Xxx xxx pre1xxxWORDsuf1 xxx xxx xxx.
10. Xxx xxx xxxWORDxxx xxx xxx xxx.

in the examples above I want to match 1, 2, 3, 4, 5 but I do not want to match 6, 7, 8, 9, 10. I started to just add OR clauses for example:

REGEXP  '[[:<:]$word[[:>:]]|[[:<:]]$word$suffix[[:>:]]'

This works fine for one exception but with multiple exceptions it gets messy.

Admittedly I'm pretty inexperienced with regex and most of what I manage to work out are simple examples that I have to read up on. Can this be done with a short and efficient regex?


Solution

  • Is this what are you looking for?

    (\b(pre1|pre2)?WORD(suf1|suf2)?\b)
    

    Online demo

    If you are looking for whole line as a match then try below regex and get if from matched group at index 1

    (.*(\b(pre1|pre2)?WORD(suf1|suf2)?\b).*)
    

    Online demo

    Use preg_match_all to get all the matched groups.