Here's my simplified test code:
<!DOCTYPE html>
<?php
//uncommenting the next line results in the whole page displaying in "chinese -simplified"
//header("content-type: text/html; charset=UTF-16");
header('Content-language: he');
?>
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=UTF-16">
<meta http-equiv="content-language" content="he-il">
</head>
<body>
<?php
// in Production, we are grabbing the hebrew word from the database
//$sql = "SELECT masoretic FROM codex WHERE id = 20"; // just grabs a word from the database
// it is stored using UTF16_general_ci on mySQL
// in this test we can mock the exact same data that was copy and pasted in
// the results were the same with the data from the db
$masoretic = "בָּרָ֣א";
echo $masoretic . '<br>'; // displays correctly in HEBREW = בָּרָ֣א
// now loop through the word and process each letter
$length = strlen($masoretic);
// even though there are only 3 real letters, the diacritic marks count as characters, so we should get at least 7 loops
for ($x = 0; $x <= $length; $x++) {
$letter = substr($masoretic,0,1); // process this letter
$masoretic = substr($masoretic, 1); // the rest of the word
$name = '';
$recognized = false;
switch($letter){
case 'ר':
$recognized = true;
$name = 'Raysh';
break;
case 'א':
$recognized = true;
$name = 'Aleph';
break;
default:
$recognized = false;
break;
}
if($recognized){
echo ('found a ' . $name);
echo $letter; // for now just display it
}else{
echo 'unrecognized letter:';
print_r($letter);
echo '<br>';
}
}
?>
</body>
the page is displaying like this:
בָּרָ֣א
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:
I find it really strange that the full hebrew word shows up ok but each individual letter won't display. I assume there's something funky going on with the UTF16, so I added headers, but in some cases that actually made it worse. (see inline comments)
In UTF-16 the glyphs are going to be represented by 2-4 bytes, so you need to use the multibyte-aware string functions, eg: mb_str_split()
.
// input in in 8 and conversion to 16 since everything on SO is UTF-8
$in_8 = 'בָּרָ֣א';
$in_16 = mb_convert_encoding($in_8, 'UTF-16', 'UTF-8');
foreach(mb_str_split($in_16, 1, 'UTF-16') as $glyph_16) {
// covert back for example display in UTF-8
$glyph_8 = mb_convert_encoding($glyph_16, 'UTF-8', 'UTF-16');
printf("%s %s\n",bin2hex($glyph_16), $glyph_8);
}
You should be able to omit the conversions in your own code, those would be for the benefit of people like me who don't work in UTF-16.
Output:
05d1 ב
05b8 ָ
05bc ּ
05e8 ר
05b8 ָ
05a3 ֣
05d0 א