I'm updating a PHP app which imports CSV encoded in UTF-16 (from Google Keyword Planner) and the values are converted to UTF-8.
Until PHP 8 it's working as expected, but from PHP 8.1 there is a ?
added to the values after the conversion from UTF-16 to UTF-8:
var_dump(mb_convert_encoding("\0008\0008\0000\000", "UTF-8", "UTF-16"));
// Output with PHP 8.1.3 - 8.1.13, 8.2.0:
// string(4) "880?"
// Output with PHP 7.4.32, 8.0.8 - 8.0.26:
// string(3) "880"
Your source equals to "\x00\x38\x00\x38\x00\x30\x00"
, which is 7 bytes and as such an invalid length for UTF-16, which always needs 2 or 4 bytes per character.
Solution: provide proper input. Maybe it's also because you misunderstood the octal notation and would see it much better without mixing notation and literals altogether:
approach | only 6 bytes (value '880' ) |
make it 8 bytes (value '8800' |
---|---|---|
full hexadecimal notation | "\x00\x38\x00\x38\x00\x30" |
"\x00\x38\x00\x38\x00\x30\x00\x30" |
mixed hexadecimal notation | "\x008\x008\x000" |
"\x008\x008\x000\x000" |
full octal notation | "\000\070\000\070\000\060" |
"\000\070\000\070\000\060\000\060" |
mixed octal notation | "\0008\0008\0000" |
"\0008\0008\0000\0000" |
concatenated string to make it more clear | "\x00". '8'. "\x00". '8'. "\x00". '0' |
"\x00". '8'. "\x00". '8'. "\x00". '0'. "\x00". '0' |