phputf-8utf-16octalmb-convert-encoding

mb_convert_encoding() with UTF-16 input in PHP > 8.1


I'm updating a PHP app which imports CSV encoded in UTF-16 (from Google Keyword Planner) and the values are converted to UTF-8.

Until PHP 8 it's working as expected, but from PHP 8.1 there is a ? added to the values after the conversion from UTF-16 to UTF-8:

var_dump(mb_convert_encoding("\0008\0008\0000\000", "UTF-8", "UTF-16"));

// Output with PHP 8.1.3 - 8.1.13, 8.2.0:
// string(4) "880?"

// Output with PHP 7.4.32, 8.0.8 - 8.0.26:
// string(3) "880"

Solution

  • Your source equals to "\x00\x38\x00\x38\x00\x30\x00", which is 7 bytes and as such an invalid length for UTF-16, which always needs 2 or 4 bytes per character.

    Solution: provide proper input. Maybe it's also because you misunderstood the octal notation and would see it much better without mixing notation and literals altogether:

    approach only 6 bytes (value '880') make it 8 bytes (value '8800'
    full hexadecimal notation "\x00\x38\x00\x38\x00\x30" "\x00\x38\x00\x38\x00\x30\x00\x30"
    mixed hexadecimal notation "\x008\x008\x000" "\x008\x008\x000\x000"
    full octal notation "\000\070\000\070\000\060" "\000\070\000\070\000\060\000\060"
    mixed octal notation "\0008\0008\0000" "\0008\0008\0000\0000"
    concatenated string to make it more clear "\x00". '8'. "\x00". '8'. "\x00". '0' "\x00". '8'. "\x00". '8'. "\x00". '0'. "\x00". '0'