Say we have a UTF-8 string $s
and we need to shorten it so it can be stored in N bytes. Blindly truncating it to N bytes could mess it up. But decoding it to find the character boundaries is a drag. Is there a tidy way?
[Edit 20100414] In addition to S.Mark’s answer: mb_strcut()
, I recently found another function to do the job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES);
from the intl extension. Since intl is an ICU wrapper, I have a lot of confidence in it.
I think you don't need to reinvent the wheel, you could just use mb_strcut and make sure you set encoding to UTF-8 first.
mb_internal_encoding('UTF-8');
echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut to 3 bytes.
its return
\xc2\x80
because in \xc2\x80\xc2, last one is invalid