python-2.7web-scrapingscraperwiki

Scraperwiki - python - skipping a table row


I'm trying to scrape a table that uses TH as a leading column element with a following TD tag. The problem is that the table uses intermittent dividers that need to be skipped because they don't contain a TH tag.

This is a sample from the table:

<tr><th scope="row">Availability (non-CRS):</th><td></td></tr>
<tr><td colspan="2" class="fieldDivider"><div>&nbsp;</div></td></tr>
<tr><th scope="row">Start Date:</th><td>01 Jun 2012</td></tr>
<tr><th scope="row">Expiry Date:</th><td>31 May 2015</td></tr>
<tr><th scope="row">Duration:</th><td>36 months</td></tr>
<tr><td colspan="2" class="fieldDivider"><div>&nbsp;</div></td></tr>
<tr><th scope="row">Total Value:</th><td>&pound;18,720,000<i>(estimated)</i></td></tr>

I'm using python in scraperwiki to gather the data, but I'm having a problem skipping the offending row.

Without any conditional my code stops as soon as I get to a row without a TH tag, so I'm currently using an if statement to make sure that I only execute the scraping on rows without a non-breaking space but my variable (data) isn't getting defined, so the if statement isn't executing properly.

This is my first bit of coding outside of a tutorial, so I expect the answer is very simple, I'm just not sure what it is.

#!/usr/bin/env python

import scraperwiki
import requests
from bs4 import BeautifulSoup

base_url = 'http://www.londoncontractsregister.co.uk/public_crs/contracts/contract-048024/'

html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")

table = soup.findAll('table')
rows = table[0].findAll('tr')


for row in rows:
    th_cell = row.findAll('th')
    td_cell = row.findAll('td')
    if td_cell[0].get_text() == '&nbsp;':
        data = {
           'description' : th_cell[0].get_text(),
           'record' : td_cell[0].get_text()
        }

print data

Solution

  • Bit of a quick hack (there are probably nicer ways of doing this), but this is based on your code and seems to get what I think you want; try and get the data and handle the exception if we can't:

    data = []
    
    for row in rows:
        th_cell = row.findAll('th')
        td_cell = row.findAll('td')
        try:
            data.append({'description': th_cell[0].get_text(),
                         'record' : td_cell[0].get_text()})
        except IndexError:
            pass
    
    for item in data:
        print data