phpunicodemultibytemultibyte-functions

Reliably rotating any string


I was experimenting with multibyte strings and how to handle them. Using the code that you can see here

https://gist.github.com/charlydagos/89f67808e01f97e6de91

I was successful in rotating most strings. However I noticed that the line

$chr = mb_substr($str, $i, 1);

Will not work for flag emojis, since they use more than a single unicode code point.

You can try the following in your own shells:

This gives desired output: $ php string_rotate_mb.php "δ½ ε₯½"

This however $ php string_rotate_mb.php "πŸ‡¨πŸ‡­" returns [H][C]

Which is technically correct, it did rotate the string. But really it's single glyph and my desired output is the flag alone (or a sequence of flags, which then becomes even more garbled glyphs, sometimes even turning it into different flags).

How can I, then, reliably determine that I should grab a $length = 1 or a $length = 2 (or a $length = N) substring using mb_substr?

For reference, I'm using PHP 7.0.2 (cli) (built: Jan 7 2016 10:40:26) ( NTS ), ZSH_VERSION = 5.2, LC_ALL=en_us.utf-8, and iTerm2: Build 2.9.git.8dff8db518.

Update - Feb 5th 2016

Solution: https://gist.github.com/charlydagos/6755ad994da07a7b4959#file-string_rotate_working-php-L39-L56

Thank you roeland for introducing the concept of Grapheme Clusters. Good info also in the following links


Solution

  • There are a lot more examples where this fails:

    And so on…

    I think what you're looking for is called grapheme clusters. Without library support I think this is pretty difficult to get right.

    For recent PHP versions there is the intl extension. You may loop over the clusters using the grapheme functions.