When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for: saka, sakā, śāka, ṣaka
etc.
Where I'm stuck is with highlighting the matches in search results. With standard RegEx, I can only match and highlight the original query word in the results -- not all the collated matches.
How would one go about solving this? I've initially thought of these approaches:
However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. The first approach would incur a mighty CPU overhead; the second would probably eat up less CPU but munch at least twice the RAM for the results. Any suggestions?
P.S. In case it's relevant: The specific character set I'm dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; two variants of M, R and S; and one variant of A, D, E, H, I, T and U; in total A-Z + 19 diacritic variants; + uppercase (that poses no problem here).
Here's what I ended up doing. Seems to have negligible impact on performance. (I noticed none!)
First, a function that converts the query word into a regular expression iterating the variants:
function iast_normalize_regex($str) {
$subst = [
'a|ā', 'd|ḍ', 'e|ӗ', 'h|ḥ', 'i|ī', 'l|ḷ|ḹ', 'm|ṁ|ṃ',
'n|ñ|ṅ|ṇ', 'r|ṛ|ṝ', 's|ś|ṣ', 't|ṭ', 'u|ū'
];
$subst_rex = [];
foreach($subst as $variants) {
$chars = explode('|', $variants);
foreach($chars as $char) {
$subst_rex[$char] = "({$variants})";
}
}
$str_chars = str_split_unicode($str);
$str_rex = '';
foreach($str_chars as $char) {
$str_rex .= !isset($subst_rex[$char]) ? $char : $subst_rex[$char];
}
return $str_rex;
}
Which turns the words saka
, śaka
etc. into (s|ś|ṣ)(a|ā)k(a|ā)
. Then, the variant-iterated word-pattern is used to highlight the search results:
$word = iast_normalize_regex($word);
$result = preg_replace("#({$word})#iu", "<b>$1</b>", $result);
Presto: I get all the variants highlighted. Thanks for the contributions so far, and please let me know if you can think of better ways to accomplish this. Cheers!