I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information, then Download your information, then create a file with at least the Messages box checked) to do some cool statistics
However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw
. When I try to open it with python (UTF-8) I get RadosÅ\x82aw
. However I should get: Radosław
.
My python script:
text = open(os.path.join(subdir, file), encoding='utf-8')
conversations.append(json.load(text))
I tried a few most common encodings. Example data is:
{
"sender_name": "Rados\u00c5\u0082aw",
"timestamp": 1524558089,
"content": "No to trzeba ostatnie treningi zrobi\u00c4\u0087 xD",
"type": "Generic"
}
I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin-1 instead. I’ll make sure to file a bug report.
What this means is that any non-ASCII character in the string data was encoded twice. First to UTF-8, and then the UTF-8 bytes were encoded again by interpreting them as Latin-1 encoded data (which maps exactly 256 characters to the 256 possible byte values), by using the \uHHHH
JSON escape notation (so a literal backslash, a literal lowercase letter u
, followed by 4 hex digits, 0-9 and a-f). Because the second step encoded byte values in the range 0-255, this resulted in a series of \u00HH
sequences (a literal backslash, a literal lower case letter u
, two 0
zero digits and two hex digits).
E.g. the Unicode character U+0142 LATIN SMALL LETTER L WITH STROKE in the name Radosław was encoded to the UTF-8 byte values C5 and 82 (in hex notation), and then encoded again to \u00c5\u0082
.
You can repair the damage in two ways:
Decode the data as JSON, then re-encode any string values as Latin-1 binary data, and then decode again as UTF-8:
>>> import json
>>> data = r'"Rados\u00c5\u0082aw"'
>>> json.loads(data).encode('latin1').decode('utf8')
'Radosław'
This would require a full traversal of your data structure to find all those strings, of course.
Load the whole JSON document as binary data, replace all \u00hh
JSON sequences with the byte the last two hex digits represent, then decode as JSON:
import re
from functools import partial
fix_mojibake_escapes = partial(
re.compile(rb'\\u00([\da-f]{2})').sub,
lambda m: bytes.fromhex(m[1].decode()),
)
with open(os.path.join(subdir, file), 'rb') as binary_data:
repaired = fix_mojibake_escapes(binary_data.read())
data = json.loads(repaired)
(If you are using Python 3.5 or older, you'll have to decode the repaired
bytes
object from UTF-8, so use json.loads(repaired.decode())
).
From your sample data this produces:
{'content': 'No to trzeba ostatnie treningi zrobić xD',
'sender_name': 'Radosław',
'timestamp': 1524558089,
'type': 'Generic'}
The regular expression matches against all \u00HH
sequences in the binary data and replaces those with the bytes they represent, so that the data can be decoded correctly as UTF-8. The second decoding is taken care of by the json.loads()
function when given binary data.