I have a JSON file that contains /u
escaped unicode characters, however when I read this in Python, the escaped characters are seemingly incorrectly decoded as Latin-1 rather than UTF-8. Calling .encode('latin-1').decode('utf-8')
on the affected strings seems to fix this, but why is it happening, and is there a way to specify to json.load
that escape sequences should be read as unicode rather than Latin-1?
JSON file message.json
, which should contain a message composed of a "Grinning Face With Sweat" emoji:
{
"message": "\u00f0\u009f\u0098\u0085"
}
Python:
>>> with open('message.json') as infile:
... msg_json = json.load(infile)
...
>>> msg_json
{'message': 'ð\x9f\x98\x85'}
>>> msg_json['message']
'ð\x9f\x98\x85'
>>> msg_json['message'].encode('latin-1').decode('utf-8')
'😅'
Setting the encoding
parameter in open
or json.load
doesn't seem to change anything, as the JSON file is plain ASCII, and the unicode is escaped within it.
What you have there is not the correct notation for the 😅 emoji; it really means "ð" and three undefined codepoints, so the translation you get is correct! (The \u...
notation is independent of encoding.)
The proper notation for 😅, unicode U+1F605, in JavaScript is \ud83d\ude05
. Use that in the JSON.
{
"message": "\ud83d\ude05"
}
If, on the other hand, your question is how you can get the correct results from the wrong data, then yes, as the comments say you may have to run through some hoops to do that.