pythonencodingutf-8mojibake

In what world would \\u00c3\\u00a9 become é?


I have a likely improperly encoded json document from a source I do not control, which contains the following strings:

d\u00c3\u00a9cor

business\u00e2\u20ac\u2122 active accounts 

the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label

From this, I am gathering they intend for \u00c3\u00a9 to beceom é, which would be utf-8 hex C3 A9. That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks.

My theory here is that this is either using some encoding I've never encountered before, or that it has been double-encoded in some way. I am fine writing some code to transform their broken input into something I can understand, as it is highly unlikely they would be able to fix the system if I brought it to their attention.

Any ideas how to force their input to something I can understand? For the record, I am working in Python.


Solution

  • You should try the ftfy module:

    >>> print ftfy.ftfy(u"d\u00c3\u00a9cor")
    décor
    >>> print ftfy.ftfy(u"business\u00e2\u20ac\u2122 active accounts")
    business' active accounts
    >>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label")
    the "Made in the USA" label
    >>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label", uncurl_quotes=False)
    the “Made in the USA” label