pythonpython-2.7html-parsingbeautifulsoupcp1251

Parsing different unicode files using BeautifulSoup


I have this particular HTML page having codec

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">

Now When I am trying to parse this particular file using BeautifulSoup, it always returns NULL object. I can convert it using:

page = codecs.open('file_name', 'r', 'cp1251')
soup = BeautifulSoup(page.read())

Now it's working fine. But in my collection, I have got pages consisting of both UTF-8 and windows-1251 charset types. So, I wanted to know what is the procedure to determine the charset of a particular HTML page, and convert it accordingly if it's in windows-1251 format ?

I found this:

soup.originalEncoding

But for that I need to load it into 'soup'. But there only it's returning 'None type object'. Any help would be highly appreciated.

I am using Python 2.7

EDIT:

Here is an example of what I am actually trying to say:

This is my code:

from bs4 import BeautifulSoup
import urllib2

page=urllib2.urlopen(Page_link)
soup = BeautifulSoup(page.read())

print soup.html.head.title

page having

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

correctly displays the title of the page.

Now if a page has

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">

then the output is

AttributeError: 'NoneType' object has no attribute 'head'

Now I can fix this using the codecs library as mentioned above. What I am trying to find out is how to determine the encoding in order to be able to apply it.

These are the two sites that am trying to crawl and gather certain informations:

http://www.orderapx.com/ and http://www.prpoakland.com/


Solution

  • You are loading your pages from the web; look for a content type header with charset parameter to see if the webserver already told you about the encoding:

    charset = page.headers.getparam('charset')
    soup = BeautifulSoup(page.read(), from_encoding=charset)
    

    If no such parameter is present, charset is set to None and BeautifulSoup will fall back to guessing.

    You can also try out different parsers; if the HTML is malformed, different parsers will repair the HTML in different ways, perhaps allowing BeautifulSoup to detect the encoding better.