I am working with the GMail API in Python to retrieve mails written in french and I'm actually having a problem with accents.
I retrieve the messages with this :
message = service.users().messages().get(userId="me", id=i, format="raw").execute()
All I want is to get the body of the mail so I start with this :
base64.urlsafe_b64decode(message['raw'].encode('ASCII'))
For some mails, it works, I retrieve all the mail data including french text like :
"Cette semaine, vous vous êtes servis du module de révision 0 fois"
For some others, I get quoted-print encoding, like this :
"Salut, =E7a farte?"
Quoted-print encoding is no issue as I have built a simple decoding function using the quopri
module. The main problem here is that the last sentence is wrong for quoted-print encoding, the encoded character is ç
and should be encoded like this :
"Salut, =C3=A7a farte?"
So with the wrong encoded sentence, I end-up with this kind of stuff :
Salut, �a farte?
I suspect the origin being the different mailing client, my first exemple is a message sent from Gmail client to an Outlook address and the second example being the opposite; An outlook message to a Gmail address.
My question here would be, is there a way to handle decoding for any possible scenario?
The problem is that while quopri
correctly translates the mail body from 7-bit data to 8-bit data, the encoding that you then use to convert this bytestring into a unicode string is not the right one. In your example, it appears to be ISO-8859-1:
In [1]: import quopri
In [2]: quopri.decodestring('Salut, =E7a farte?').decode('iso-8859-1')
Out[2]: 'Salut, ça farte?'
Usually you should be able to get the correct encoding using the Content-Type
header. This is how it looks like in a mail that uses quoted-printable UTF-8 encoding:
Content-Type: text/plain;charset=UTF-8
Content-Transfer-Encoding: quoted-printable