From other source i get two names with two polish letter (ń
and ó
), like below:
Of course these names is more then two.
The 1st should be looks like piaseczyński
and the 2nd looks good. But when I use some operation to fix it using:
str(entity_name).encode('1252').decode('utf-8')
then 1st is fixed, but 2nd return error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
Why polish letter are not treated the same? How to fix it?
As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).
If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError
, try to decode it as some other encoding:
def decode_crappy_bytes(b):
try:
return b.decode('utf-8')
except UnicodeDecodeError:
return b.decode('1252')
Note that this can still fail, in two ways:
If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.