pythonweb-scrapingbeautifulsoupwindmill

Windmill not getting all html content


I'm trying to scrape the data off a web page using the python Windmill framework. However I'm having problems getting the HTML table content off a page. The table is generated by Javascript - hence I'm using Windmill to grab the content. However the content doesn't return the table - which causes errors if I use BeautifulSoup to try and parse the content.

from windmill.authoring import WindmillTestClient
from BeautifulSoup import BeautifulSoup

from copy import copy
import re

def get_massage():
    my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)
    my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))
    my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))
    return my_massage

def test_scrape():
    my_massage = get_massage()
    client = WindmillTestClient(__name__)
    client.open(url='http://marinetraffic.com/ais/datasheet.aspx?MMSI=636092060&TIMESTAMP=2&menuid=&datasource=POS&app=&mode=&B1=Search')
    client.waits.forPageLoad(timeout='60000')
    html = client.commands.getPageText()
    assert html['status']
    assert html['result']
    soup=BeautifulSoup(html['result'],markupMassage=my_massage)
    print soup.prettify()

When you look at the output from the soup the table is missing, yet it's displayed if you look at the webpage content with something like firebug. Overall I'm trying to grab the table content and parse it into some kind of data structure for further processing. Any help is much appreciated!


Solution

  • The problem is that the markup massage you're using isn't working fine for the page you're working on, that is, it's removing more html code than it should.

    To verify if BeautifulSoup could be able to parse the web page you need, I just tried this:

    soup = BeautifulSoup(html['result'])
    soup.table
    

    and it worked fine, so it seems that in this case there's no need for any markup massage after all.