pythonpython-requestscontent-encoding

Wrong encoding when displaying an HTML Request in Python


I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).

Yet when I try r.encoding, I get utf-8.

In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.

Try as follows:

r = requests.get("https://gks.gs/login")
print r.text

There encoded characters which are displayed, we can see Mot de passe oublié ?.

I do not understand why. Do you think it may be because of https? How to fix this please?


Solution

  • These are HTML character entity references, the easiest way to decode them is:

    In Python 2.x:

    >>> import HTMLParser
    >>> HTMLParser.HTMLParser().unescape('oublié')
    'oublié'
    

    In Python 3.x:

    >>> import html.parser
    >>> html.parser.HTMLParser().unescape('oublié')
    'oublié'