pythonpython-2.7emaildecodestring-decoding

Python 2.7 - Unable to correctly decode email subject-header line


I'm using Python 2.7, and I am trying to properly decode the subject header line of an email. The source of the email is:

Subject: =?UTF-8?B?VGkgw6ggcGlhY2l1dGEgbGEgZGVtbz8gU2NvcHJpIGFsdHJlIG4=?=

I use the function decode_header(header) from the email.header library, and the result is:

[('Ti \xc3\xa8 piaciuta la demo? Scopri altre n', 'utf-8')]

The 'xc3\xa8' part should match the 'è' character, but it is not correctly decoded/showed. Another example:

Subject: =?iso-8859-1?Q?niccol=F2_cop?= =?iso-8859-1?Q?ernico?=

Result:

[('niccol\xf2 copernico', 'iso-8859-1')]

How can I obtain the correct string?


Solution

  • You are getting the correct string. It's just encoded (using UTF-8 in the first case, and iso-8895-1 in the second); you need to decode it to get the actual unicode string.

    For example:

    >>> print unicode('Ti \xc3\xa8 piaciuta la demo? Scopri altre n', 'utf-8')
    Ti è piaciuta la demo? Scopri altre n
    

    Or:

    >>> print unicode('niccol\xf2 copernico', 'iso-8859-1')
    niccolò copernico
    

    That's why you get back both the header data and the encoding.