phpregexutf8mb4

How to wrap 32-bit (4 bytes) emojis in arbitrary string using PHP?


I'm using this PHP function to wrap emojis in arbitrary HTML tags, which allows me to style them on web pages, since CSS3 does not (yet?) directly support styling of multi-byte characters, at least I haven't found any CSS selector for that purpose:

function wrap_emojis($s, $str_before, $str_after) {
    $default_encoding = mb_regex_encoding();
    mb_regex_encoding('UTF-8');
    $s = mb_ereg_replace('([^\x{0000}-\x{FFFF}])', $str_before . '\\1' . $str_after, $s);
    mb_regex_encoding($default_encoding);
    return $s;
}

The issue is that it works for lower range emojis such as 😎 (01F60E) but it does not work for higher range emojis such as ☀️ (2600FE0F)

Any ideas how to fix the PHP function so that it works with 4 bytes range as well?

e.g. if I call wrap_emojis("zzz☀️zzz", "A", "B"); Expected result: "zzzA☀️Bzzz". Actual result: "zzz☀️zzz". But it works with lower range emojis as noted in the question, e.g. wrap_emojis("zzz😎zzz", "A", "B") returns: "zzzA😎Bzzz"


Solution

  • Alright, so it wasn't that hard, I just had to write the RegEx matching 2 groups of 2 bytes (mb4 with "variation selector") OR (when none is found) then any character not in lower 2 bytes range. Pretty sure it will cause issues in foreign languages, but in English, it works great!

    $s = mb_ereg_replace('([\x{0100}-\x{FFFF}][\x{0000}-\x{FFFF}]|[^\x{0000}-\x{FFFF}])', $str_before . '\\1' . $str_after, $s);
    

    Hope it enlightens other people on here. Cheers 🤣