pythonunicodepolish

Encoding and decoding for chars are not treated the same for polish letters


From other source i get two names with two polish letter (ń and ó), like below:

Of course these names is more then two.

The 1st should be looks like piaseczyński and the 2nd looks good. But when I use some operation to fix it using: str(entity_name).encode('1252').decode('utf-8') then 1st is fixed, but 2nd return error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte

Why polish letter are not treated the same? How to fix it?


Solution

  • As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).

    If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError, try to decode it as some other encoding:

    def decode_crappy_bytes(b):
        try:
            return b.decode('utf-8')
        except UnicodeDecodeError:
            return b.decode('1252')
    

    Note that this can still fail, in two ways:

    1. If you get a string in some non-UTF-8 encoding that happens to be decodable as UTF-8 as well.
    2. If you get a string in a non-UTF-8 encoding that's not Windows codepage 1252. Another common one in Europe is ISO-8859-1 (Latin-1). Every bytestring that's valid in one is also valid in the other.

    If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.