pythonencodinggmailgmail-apiquoted-printable

GMail API decoding messages from everywhere


I am working with the GMail API in Python to retrieve mails written in french and I'm actually having a problem with accents.

I retrieve the messages with this :

 message = service.users().messages().get(userId="me", id=i, format="raw").execute()

All I want is to get the body of the mail so I start with this :

base64.urlsafe_b64decode(message['raw'].encode('ASCII'))

For some mails, it works, I retrieve all the mail data including french text like :

"Cette semaine, vous vous êtes servis du module de révision 0 fois"

For some others, I get quoted-print encoding, like this :

"Salut, =E7a farte?"

Quoted-print encoding is no issue as I have built a simple decoding function using the quopri module. The main problem here is that the last sentence is wrong for quoted-print encoding, the encoded character is ç and should be encoded like this :

"Salut, =C3=A7a farte?"

So with the wrong encoded sentence, I end-up with this kind of stuff :

Salut, �a farte?

I suspect the origin being the different mailing client, my first exemple is a message sent from Gmail client to an Outlook address and the second example being the opposite; An outlook message to a Gmail address.

My question here would be, is there a way to handle decoding for any possible scenario?


Solution

  • The problem is that while quopri correctly translates the mail body from 7-bit data to 8-bit data, the encoding that you then use to convert this bytestring into a unicode string is not the right one. In your example, it appears to be ISO-8859-1:

    In [1]: import quopri
    
    In [2]: quopri.decodestring('Salut, =E7a farte?').decode('iso-8859-1')
    Out[2]: 'Salut, ça farte?'
    

    Usually you should be able to get the correct encoding using the Content-Type header. This is how it looks like in a mail that uses quoted-printable UTF-8 encoding:

    Content-Type: text/plain;charset=UTF-8
    Content-Transfer-Encoding: quoted-printable