I requested a data download from Instagram and I chose the JSON format. However, when I got the file and unzipped it, every non-ASCII character was represented as a Unicode escape sequence. E.g.:
"sender_name": "Le\u00c3\u00b3 Tak\u00c3\u00a1cs"
The correct text would be: "sender_name": "Leó Takács"
I tried parsing the JSON file with Python and correcting the errors somehow, but instead of getting "ó" for "\u00c3\u00b3", I got ó. It seems like, every way I tried, it always returned the characters individually decoded. The same thing happened with emojies too, so hardcoding every problematic character to be replaced would be a bit of a headache. I would prefer a solution that is doable programatically, but at this point any idea including 3rd party software plays.
It would appear that Each UTF-8 byte is being interpreted as a Unicode character, or in other words UTF-8 being interpreted as Latin 1 encoded text.
data = '"sender_name": "Le\u00c3\u00b3 Tak\u00c3\u00a1cs"'
cleaned = data.encode('latin-1').decode('utf-8')
print(cleaned)
# "sender_name": "Leó Takács"
i.e. "Le\u00c3\u00b3 Tak\u00c3\u00a1cs"
should have been b'"Le\xc3\xb3 Tak\xc3\xa1cs"'
.