I need help to detect when a string contains 4-byte characters using PHP. Is there a built in function or regex that can efficiently do this?
I have found this article that talks about replacing, but I cannot find a working example that just detects.
Can php detect 4-byte encoded utf8 chars?
This is about as far as I got but it fails too:
$chars = str_split($term);
foreach ($chars as $char) {
if (strlen($char) >= 4) {
print "Found 4-byte character\n";
}
}
You can use regex to match all characters outside of BMP, which are all characters in Unicode space above U+FFFF
$str = 'โฌ๐A๐ยข';
$r = preg_match_all('|[\x{10000}-\x{10FFFF}]|u', $str, $matches);
var_dump($matches[0]);
Try it here: https://3v4l.org/JX9aQ
Interesting fact. If you are using PHP 7.4 you can do that using mb_str_split()
and array_filter()
. I don't think it will be more efficient than regex, but good to know.
$nonBMP = array_filter(mb_str_split($str), fn($c) => strlen($c)==4);