phputf-8character-encodingmb-convert-encoding

Unexpected result from mb_detect_encoding with Windows-1252


I've read Wikipedia's article on Windows-1252 character encoding. For characters whose byte value is < 128, it should be the same as ASCII/UTF-8.

This makes sense:

php -r "var_export(mb_detect_encoding(\"\x92\", 'windows-1252', true));" 'Windows-1252'

A left curly apostrophe is detected properly.

php -r "var_export(mb_detect_encoding(\"a\", 'windows-1252', true));" false

Huh? The letter "a" isn't Windows-1252?

My terminal, where I"m running this, is set to UTF-8. So that should be the same byte sequence as ASCII for the letter 'a'. For the sake of minimizing the variables, if I specify the right Windows-1252 byte sequence:

php -r "var_export(mb_detect_encoding(\"\x61\", 'windows-1252', true));" false

Changing the "strict" parameter (which has pretty useless documentation) does nothing in these cases.


Solution

  • Encoding detection is not supported for windows-1252. According to the mb_detect_order documentation:

    mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.

    UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP

    For ISO-8859-, mbstring always detects as ISO-8859-.

    For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.