phpsplitutf-8diacritics

How can I process a UTF8 sentence with Umlauts letter by letter?


I have a German sentence with "Umlaute" (ä, ö, ü, Ä, Ö, Ü, ß) and want to process it letter by letter. Let's say I want to write each word backwards.

What I have: "Die Straße enthält viele größere Schlaglöcher"
What I want: "eiD eßartS tlähtne eleiv ereßörg rehcölgalhcS"

I tried to explode the sentence to an array consisting of single words:

$MyText = "Die Straße enthält viele größere Schlaglöcher";
$Words = preg_split(@"/[^\wäöüÄÖÜß]/", $MyText);

But as soon as I try to iterate the $Words array I have a problem because it contains letters ("ä", "ö", "ü", ...) that are represented by 2 bytes (UTF8) and writing them backwards does not work!


Solution

  • One solution is like the following:

    1. Convert the UTF8-String to a Unicode-String (UTF32) where every letter consists of 4 bytes
    2. Split that string into 4-byte-chunks => $LetterArray
    3. Do something with $LetterArray
    4. Combine the changed array to a new string
    5. Convert the new string back to UTF8

    Here comes a code snippet how this could be done.

    $Word = "🙄enthält😀";
    $Word_Unicode = mb_convert_encoding($Word, 'UTF-32', 'UTF-8');
    $Letters = str_split($Word_Unicode, 4);
    $Letters = array_reverse($Letters);
    $NewWord_Unicode = implode("", $Letters);
    $NewWord = mb_convert_encoding($NewWord_Unicode, 'UTF-8', 'UTF-32');
    

    Result: 🙄enthält😀 => 😀tlähtne🙄

    Try it here: https://onlinephp.io/c/69a20