pythonbeautifulsoupurllib2html5lib

Beautifulsoup functionality not working properly in specific scenario


I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect.

It reads the following data in:

   <!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   <html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta content="0;url= Home.html" http-equiv="refresh"/></head><body></body></html>

Reading it into Beautifulsoup works fine. However for some reason none of the functionality works for this specific senarious, and I don't understand why. Beautifulsoup has worked great for me in all other scenarios. However, when simply trying:

    soup.findAll('meta')

produces no results.

My eventual goal is to run:

    soup.find("meta",attrs={"http-equiv":"refresh"})

But if:

    soup.findAll('meta')

isn't even working then I'm stuck. Any incite into this mystery would be appreciated, thanks!


Solution

  • It's the comment and doctype that throws the parser here, and subsequently, BeautifulSoup.

    Even the HTML tag seems 'gone':

    >>> soup.find('html') is None
    True
    

    Yet it is there in the .contents iterable still. You can find things again with:

    for elem in soup:
        if getattr(elem, 'name', None) == u'html':
            soup = elem
            break
    
    soup.find_all('meta')
    

    Demo:

    >>> for elem in soup:
    ...     if getattr(elem, 'name', None) == u'html':
    ...         soup = elem
    ...         break
    ... 
    >>> soup.find_all('meta')
    [<meta content="0;url= Home.html" http-equiv="refresh"/>]