pythonjsonencodingdecodingutf

Encoding in utf-16be and decoding in utf-8 print the correct output but cannot be converted into Python representation?


If I'm encoding a string using utf-16be and decoding the encoded string using utf-8, I'm not getting any error and the output seems to be correctly getting printed on the screen as well but still I'm not able to convert the decoded string into Python representation using json module.

import json

str = '{"foo": "bar"}'
encoded_str = str.encode("utf-16be")
decoded_str = encoded_str.decode('utf-8')
print(decoded_str)
print(json.JSONDecoder().decode(decoded_str))

I know that encoded string should be decoded using the same encoding, but why this behaviour is what I'm trying to understand? I want to know:

  1. Why encoding str with utf-16be  and decoding encoded_str  with utf-8 doesn't result in an error?

  2. As encoding and decoding  is not resulting in an error and the decoded_str is a valid JSON (as can be seen through the print statement), why decode(decoded_str) result in an error?

  3. Why writing the output to a file and viewing the file through less command show it as binary file?

    file = open("data.txt", 'w')
    file.write(decoded_str)
    

    When using less command to view the data.txt:

    "data.txt" may be a binary file.  See it anyway?
    
  4. If the decoded_str is an invalid JSON or something else, how can I view it in its original form (print() is printing it as a valid JSON )

I'm using Python 3.10.12 on Ubuntu 22.04.4 LTS


Solution

    1. Why encoding str with utf-16be and decoding encoded_str with utf-8 doesn't result in an error?

    Because in this case, the resulting bytes of str.encode("utf-16be") are also valid UTF-8. This is in fact always the case with ASCII characters, you really need to go above U+007F to trigger possible errors here (eg. use the string str = '{"foo": "!"}' which uses a full-width exclamation mark, U+FF01).

    1. As encoding and decoding is not resulting in an error and the decoded_str is a valid JSON (as can be seen through the print statement), why decode(decoded_str) result in an error?

    Just because you can print a string does not make it valid JSON. In particular because of the encoding to UTF-16, a bunch of null bytes got added. For example, f in UTF-16BE is 0x0066. Those bytes when re-encoded in UTF-8 actually constitute two characters, f and the null character 0x00. Based on my reading of the JSON spec, null characters are not allowed and that is why decode(decoded_str) fails.

    1. Why writing the output to a file and viewing the file through less command show it as binary file?

    Probably those null bytes again. With a lot of null bytes, less is probably flagging it might be a binary file as this is relatively uncommon in UTF-8 (and Linux much prefers UTF-8 over UTF-16)

    1. If the decoded_str is an invalid JSON or something else, how can I view it in its original form (print() is printing it as a valid JSON )

    Too many possible answers here, it really depends on what the real use case is here. The quickest one is just don't encode/decode with different encodings. The next quickest is reverse the encode/decode process, though this is not lossless with all strings or encoding possibilities, in particular the surrogate range when dealing with a UTF-16 + UTF-8 mix-up.