I have this particular HTML page having codec
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">
Now When I am trying to parse this particular file using BeautifulSoup, it always returns NULL object. I can convert it using:
page = codecs.open('file_name', 'r', 'cp1251')
soup = BeautifulSoup(page.read())
Now it's working fine. But in my collection, I have got pages consisting of both UTF-8 and windows-1251 charset types. So, I wanted to know what is the procedure to determine the charset of a particular HTML page, and convert it accordingly if it's in windows-1251 format ?
I found this:
soup.originalEncoding
But for that I need to load it into 'soup'. But there only it's returning 'None type object'. Any help would be highly appreciated.
I am using Python 2.7
EDIT:
Here is an example of what I am actually trying to say:
This is my code:
from bs4 import BeautifulSoup
import urllib2
page=urllib2.urlopen(Page_link)
soup = BeautifulSoup(page.read())
print soup.html.head.title
page having
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
correctly displays the title of the page.
Now if a page has
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">
then the output is
AttributeError: 'NoneType' object has no attribute 'head'
Now I can fix this using the codecs library as mentioned above. What I am trying to find out is how to determine the encoding in order to be able to apply it.
These are the two sites that am trying to crawl and gather certain informations:
You are loading your pages from the web; look for a content type header with charset
parameter to see if the webserver already told you about the encoding:
charset = page.headers.getparam('charset')
soup = BeautifulSoup(page.read(), from_encoding=charset)
If no such parameter is present, charset
is set to None
and BeautifulSoup will fall back to guessing.
You can also try out different parsers; if the HTML is malformed, different parsers will repair the HTML in different ways, perhaps allowing BeautifulSoup to detect the encoding better.