phpmysqlregexcollationunicode-normalization

Highlighting Search Results: RegEx Character Collation?


When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for: saka, sakā, śāka, ṣaka etc.

Where I'm stuck is with highlighting the matches in search results. With standard RegEx, I can only match and highlight the original query word in the results -- not all the collated matches.

How would one go about solving this? I've initially thought of these approaches:

However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. The first approach would incur a mighty CPU overhead; the second would probably eat up less CPU but munch at least twice the RAM for the results. Any suggestions?

P.S. In case it's relevant: The specific character set I'm dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; two variants of M, R and S; and one variant of A, D, E, H, I, T and U; in total A-Z + 19 diacritic variants; + uppercase (that poses no problem here).


Solution

  • Here's what I ended up doing. Seems to have negligible impact on performance. (I noticed none!)

    First, a function that converts the query word into a regular expression iterating the variants:

    function iast_normalize_regex($str) {
    
        $subst = [ 
            'a|ā', 'd|ḍ', 'e|ӗ', 'h|ḥ', 'i|ī', 'l|ḷ|ḹ', 'm|ṁ|ṃ', 
            'n|ñ|ṅ|ṇ', 'r|ṛ|ṝ', 's|ś|ṣ', 't|ṭ', 'u|ū' 
            ];
    
        $subst_rex = [];
    
        foreach($subst as $variants) {
            $chars = explode('|', $variants);
            foreach($chars as $char) {
                $subst_rex[$char] = "({$variants})";
            }
        }
    
        $str_chars = str_split_unicode($str);
    
        $str_rex = '';
        foreach($str_chars as $char) {
            $str_rex .= !isset($subst_rex[$char]) ? $char : $subst_rex[$char];
        }
    
        return $str_rex;
    }
    

    Which turns the words saka, śaka etc. into (s|ś|ṣ)(a|ā)k(a|ā). Then, the variant-iterated word-pattern is used to highlight the search results:

    $word = iast_normalize_regex($word);
    $result = preg_replace("#({$word})#iu", "<b>$1</b>", $result);
    

    Presto: I get all the variants highlighted. Thanks for the contributions so far, and please let me know if you can think of better ways to accomplish this. Cheers!