I'm trying to scrape this NREGA Website which contains data in Hindi i.e. Devanagari script. The structure is pretty easy to scrape. But when I use requests/urllib to get the html code, the Hindi text is getting converted to some gibberish. The text is displayed fine in the code source of the site though.
content = requests.get(URL).text
' 1 पी एस ' in the site is being parsed as ' 1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8 ' into content and is displayed as gibberish when I try to export to a csv.
The response from the server doesn't specify a charset in it's Content-Type header, so requests assumes that the page is encoded as ISO-8859-1 (latin-1).
>>> r = requests.get('https://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/funddisreport_2701004_eng_1314_.html')
>>> r.encoding
'ISO-8859-1'
In fact, the page is encoded as UTF-8, as we can tell by inspecting the response's apparent_encoding
attribute:
>>> r.apparent_encoding
'utf-8'
or by experiment:
>>> s = '1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8'
>>> s.encode('latin').decode('utf-8')
'1 पी एस'
The correct output can be obtained by decoding the response's content
attribute:
>>> html = r.content.decode(r.apparent_encoding)