[SOLVED] Unable to read Hindi/Devanagari with Python requests / urllib modules

Unable to read Hindi/Devanagari with Python requests / urllib modules

I'm trying to scrape this NREGA Website which contains data in Hindi i.e. Devanagari script. The structure is pretty easy to scrape. But when I use requests/urllib to get the html code, the Hindi text is getting converted to some gibberish. The text is displayed fine in the code source of the site though.

content = requests.get(URL).text

' 1 पी एस ' in the site is being parsed as ' 1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8 ' into content and is displayed as gibberish when I try to export to a csv.

Solution

The response from the server doesn't specify a charset in it's Content-Type header, so requests assumes that the page is encoded as ISO-8859-1 (latin-1).

>>> r = requests.get('https://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/funddisreport_2701004_eng_1314_.html')
>>> r.encoding
'ISO-8859-1'

In fact, the page is encoded as UTF-8, as we can tell by inspecting the response's apparent_encoding attribute:

>>> r.apparent_encoding
'utf-8'

or by experiment:

>>> s = '1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8'
>>> s.encode('latin').decode('utf-8')
'1 पी एस'

The correct output can be obtained by decoding the response's content attribute:

>>> html = r.content.decode(r.apparent_encoding)