I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text
, special characters (such as accents) are encoded (é
= é
for example).
Yet when I try r.encoding
, I get utf-8
.
In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.
Try as follows:
r = requests.get("https://gks.gs/login")
print r.text
There encoded characters which are displayed, we can see Mot de passe oublié ?
.
I do not understand why. Do you think it may be because of https? How to fix this please?
These are HTML character entity references, the easiest way to decode them is:
In Python 2.x:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'
In Python 3.x:
>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'