[SOLVED] Wrong encoding when displaying an HTML Request in Python

Wrong encoding when displaying an HTML Request in Python

I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).

Yet when I try r.encoding, I get utf-8.

In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.

Try as follows:

r = requests.get("https://gks.gs/login")
print r.text

There encoded characters which are displayed, we can see Mot de passe oublié ?.

I do not understand why. Do you think it may be because of https? How to fix this please?

Solution

These are HTML character entity references, the easiest way to decode them is:

In Python 2.x:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oubli&eacute;')
'oublié'

In Python 3.x:

>>> import html.parser
>>> html.parser.HTMLParser().unescape('oubli&eacute;')
'oublié'