The request module encoding
provides different encoding then the actual set encoding in HTML page
Code:
import requests
URL = "http://www.reynamining.com/nuevositio/contacto.html"
obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True)
print obj.encoding
Output:
ISO-8859-1
Where as the actual encoding set in the HTML is UTF-8
content="text/html; charset=UTF-8"
My Question are:
requests.encoding
showing different encoding then the encoding described in the HTML page?.I am trying to convert the encoding into UTF-8 using this method objReq.content.decode(encodes).encode("utf-8")
since it is already in UTF-8
when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.) á
changes to this Ã
Is there any way to convert all type of encodes into UTF-8 ?
Requests sets the response.encoding
attribute to ISO-8859-1
when you have a text/*
response and no content type has been specified in the response headers.
See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the
Content-Type
header containstext
. In this situation, RFC 2616 specifies that the default charset must beISO-8859-1
. Requests follows the specification in this case. If you require a different encoding, you can manually set theResponse.encoding
property, or use the rawResponse.content
.
Bold emphasis mine.
You can test for this by looking for a charset
parameter in the Content-Type
header:
resp = requests.get(....)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
Your HTML document specifies the content type in a <meta>
header, and it is this header that is authoritative:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
HTML 5 also defines a <meta charset="..." />
tag, see <meta charset="utf-8"> vs <meta http-equiv="Content-Type">
You should not recode HTML pages to UTF-8 if they contain such a header with a different codec. You must at the very least correct that header in that case.
Using BeautifulSoup:
# pass in explicit encoding if set as a header
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
content = resp.content
soup = BeautifulSoup(content, from_encoding=encoding)
if soup.original_encoding != 'utf-8':
meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]')
if meta:
# replace the meta charset info before re-encoding
if 'charset' in meta.attrs:
meta['charset'] = 'utf-8'
else:
meta['content'] = 'text/html; charset=utf-8'
# re-encode to UTF-8
content = soup.prettify() # encodes to UTF-8 by default
Similarly, other document standards may also specify specific encodings; XML for example is always UTF-8 unless specified by a <?xml encoding="..." ... ?>
XML declaration, again part of the document.