phpstringparsingutf-16hebrew

I'm trying to parse a Hebrew word in PHP. It looks ok as a string, but when I try to split it out into characters it won't display correctly


Here's my simplified test code:

<!DOCTYPE html>
    <?php
        //uncommenting the next line results in the whole page displaying in "chinese -simplified"
        //header("content-type: text/html; charset=UTF-16");
        header('Content-language: he');
    ?>
<html>
<head>
    <meta http-equiv=Content-Type content="text/html; charset=UTF-16">
    <meta http-equiv="content-language" content="he-il">
</head>
<body>
<?php
        // in Production, we are grabbing the hebrew word from the database
        //$sql = "SELECT masoretic FROM codex WHERE id = 20"; // just grabs a word from the database
                                                            // it is stored using UTF16_general_ci on mySQL
        // in this test we can mock the exact same data that was copy and pasted in
        // the results were the same with the data from the db
            $masoretic = "בָּרָ֣א";

            echo $masoretic . '<br>'; // displays correctly in HEBREW = בָּרָ֣א
            // now loop through the word and process each letter
            $length = strlen($masoretic);
            // even though there are only 3 real letters, the diacritic marks count as characters, so we should get at least 7 loops
            for ($x = 0; $x <= $length; $x++) {
                $letter = substr($masoretic,0,1); // process this letter
                $masoretic = substr($masoretic, 1); // the rest of the word
                $name = '';
                $recognized = false;
                switch($letter){
                    case 'ר':
                        $recognized = true;
                        $name = 'Raysh';
                        break;
                    case 'א':
                        $recognized = true;
                        $name = 'Aleph';
                        break;
                    default:
                        $recognized = false;
                        break;
                }
                if($recognized){
                    echo ('found a ' . $name);
                    echo $letter; // for now just display it
                }else{
                        echo 'unrecognized letter:';
                        print_r($letter);
                        echo '<br>';
                }                       
            }           
    ?>
</body>

the page is displaying like this:

בָּרָ֣א
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:

I find it really strange that the full hebrew word shows up ok but each individual letter won't display. I assume there's something funky going on with the UTF16, so I added headers, but in some cases that actually made it worse. (see inline comments)


Solution

  • In UTF-16 the glyphs are going to be represented by 2-4 bytes, so you need to use the multibyte-aware string functions, eg: mb_str_split().

    // input in in 8 and conversion to 16 since everything on SO is UTF-8
    $in_8  = 'בָּרָ֣א';
    $in_16 = mb_convert_encoding($in_8, 'UTF-16', 'UTF-8');
    
    foreach(mb_str_split($in_16, 1, 'UTF-16') as $glyph_16) {
        // covert back for example display in UTF-8
        $glyph_8 = mb_convert_encoding($glyph_16, 'UTF-8', 'UTF-16');
        printf("%s %s\n",bin2hex($glyph_16), $glyph_8);
    }
    

    You should be able to omit the conversions in your own code, those would be for the benefit of people like me who don't work in UTF-16.

    Output:

    05d1 ב
    05b8 ָ
    05bc ּ
    05e8 ר
    05b8 ָ
    05a3 ֣
    05d0 א