pythonjsonunicodeencodinginstagram

Unable to parse non-ASCII characters from exported Instagram chat log


I requested a data download from Instagram and I chose the JSON format. However, when I got the file and unzipped it, every non-ASCII character was represented as a Unicode escape sequence. E.g.:

"sender_name": "Le\u00c3\u00b3 Tak\u00c3\u00a1cs"

The correct text would be: "sender_name": "Leó Takács"

I tried parsing the JSON file with Python and correcting the errors somehow, but instead of getting "ó" for "\u00c3\u00b3", I got ó. It seems like, every way I tried, it always returned the characters individually decoded. The same thing happened with emojies too, so hardcoding every problematic character to be replaced would be a bit of a headache. I would prefer a solution that is doable programatically, but at this point any idea including 3rd party software plays.


Solution

  • It would appear that Each UTF-8 byte is being interpreted as a Unicode character, or in other words UTF-8 being interpreted as Latin 1 encoded text.

    data = '"sender_name": "Le\u00c3\u00b3 Tak\u00c3\u00a1cs"'
    cleaned = data.encode('latin-1').decode('utf-8')
    print(cleaned)
    # "sender_name": "Leó Takács"
    

    i.e. "Le\u00c3\u00b3 Tak\u00c3\u00a1cs" should have been b'"Le\xc3\xb3 Tak\xc3\xa1cs"'.