pythonweb-scrapingpython-requestscharacter-encodinghindi

Unable to read Hindi/Devanagari with Python requests / urllib modules


I'm trying to scrape this NREGA Website which contains data in Hindi i.e. Devanagari script. The structure is pretty easy to scrape. But when I use requests/urllib to get the html code, the Hindi text is getting converted to some gibberish. The text is displayed fine in the code source of the site though.

content = requests.get(URL).text

' 1 पी एस ' in the site is being parsed as ' 1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8 ' into content and is displayed as gibberish when I try to export to a csv.


Solution

  • The response from the server doesn't specify a charset in it's Content-Type header, so requests assumes that the page is encoded as ISO-8859-1 (latin-1).

    >>> r = requests.get('https://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/funddisreport_2701004_eng_1314_.html')
    >>> r.encoding
    'ISO-8859-1'
    

    In fact, the page is encoded as UTF-8, as we can tell by inspecting the response's apparent_encoding attribute:

    >>> r.apparent_encoding
    'utf-8'
    

    or by experiment:

    >>> s = '1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8'
    >>> s.encode('latin').decode('utf-8')
    '1 पी एस'
    

    The correct output can be obtained by decoding the response's content attribute:

    >>> html = r.content.decode(r.apparent_encoding)