utf-8decoding

UnicodeDecoder Error charmap decoder can't decode Serbian latin alphabet letters like č, ć, ž, š, đ?


I have a csv file that is read by a python script and it throws an error whenever there are Serbian latin alphabet letters in the file. The decoder decodes these letters into nothing. Is there a way to somehow give instructions into what it needs to be decoded or change it somehow without going through all of the strings in the file.

This is the error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1567: character maps to <undefined>

The only way i see it could be done is by replacing all of the characters into English latin equivalent letters but that is very time consuming.


Solution

  • Your error message indicates that you are using a charmap codec. The Python docs have this to say about them:

    There’s another group of encodings (the so called charmap encodings) that choose a different subset of all Unicode code points and how these code points are mapped to the bytes 0x0–0xff. [...]

    All of these encodings can only encode 256 of the 1114112 code points defined in Unicode.

    (Emphasis added.)

    It is not particularly surprising, then, that you discovered some characters your codec cannot handle.

    It's not clear how your file is encoded. UTF-8 appears likely from the error message, but there are several other possibilities. It's also not clear how your script ends up choosing a charmap codec for decoding the file, or which one it chooses. Whatever part of your code is choosing the codec needs to select one appropriate for the file's actual encoding instead.

    Alternatively, it may be that the script is specific to a file format that does not support the characters you're asking about. If that's the case then the error is not in the script but in the data.