phputf-8iso-8859-1cp1250

Encoding conversion in PHP (ISO-8859-1, UTF-8, CP1250)


I want to work with data from CSV file, but I realized letters are not showing correctly. I tried million ways to convert the encoding but nothing works. Working on MacOS, PHP 7.4.4.

After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).

Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od

1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00

It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø, neither of them are part of czech alphabet. I am confident, there will be more of the misplaced letters in the file.

Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included.

Neither of following commands didn't converted misplaced letters: mb_convert_encoding($line, 'UTF-8', 'ISO8859-1') iconv('ISO-8859-1', 'UTF-8', $line) iconv('ISO8859-1', 'UTF-8', $line)

I have noticed that in ISO-8859-1 the ø letter has a code 00F8. Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8. Same with letter č and è which are both preceded by code 00E7. I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.

But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.

Does anyone has any idea how to solve this? Thanks!


Solution

  • The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage.

    When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2. To file, 0x80 is the same as 0x80.

    file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF. I.e. not just ASCII.

    (Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file)

    There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.

    The likely conversion you need is simply iso-8895-2 -> UTF-8.

    What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.

    For example, PHP will by default set the HTTP charset to iso-8895-1. That means it's quite possible for you to be converting correctly to iso-8895-2, but your browser will then "interpret" as iso-8895-1.

    The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.

    If you need further help, you will need to edit your question to include the exact code you're using.