I'm trying to scrape a table that uses TH as a leading column element with a following TD tag. The problem is that the table uses intermittent dividers that need to be skipped because they don't contain a TH tag.
This is a sample from the table:
<tr><th scope="row">Availability (non-CRS):</th><td></td></tr>
<tr><td colspan="2" class="fieldDivider"><div> </div></td></tr>
<tr><th scope="row">Start Date:</th><td>01 Jun 2012</td></tr>
<tr><th scope="row">Expiry Date:</th><td>31 May 2015</td></tr>
<tr><th scope="row">Duration:</th><td>36 months</td></tr>
<tr><td colspan="2" class="fieldDivider"><div> </div></td></tr>
<tr><th scope="row">Total Value:</th><td>£18,720,000<i>(estimated)</i></td></tr>
I'm using python in scraperwiki to gather the data, but I'm having a problem skipping the offending row.
Without any conditional my code stops as soon as I get to a row without a TH tag, so I'm currently using an if statement to make sure that I only execute the scraping on rows without a non-breaking space but my variable (data) isn't getting defined, so the if statement isn't executing properly.
This is my first bit of coding outside of a tutorial, so I expect the answer is very simple, I'm just not sure what it is.
#!/usr/bin/env python
import scraperwiki
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.londoncontractsregister.co.uk/public_crs/contracts/contract-048024/'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
table = soup.findAll('table')
rows = table[0].findAll('tr')
for row in rows:
th_cell = row.findAll('th')
td_cell = row.findAll('td')
if td_cell[0].get_text() == ' ':
data = {
'description' : th_cell[0].get_text(),
'record' : td_cell[0].get_text()
}
print data
Bit of a quick hack (there are probably nicer ways of doing this), but this is based on your code and seems to get what I think you want; try and get the data and handle the exception if we can't:
data = []
for row in rows:
th_cell = row.findAll('th')
td_cell = row.findAll('td')
try:
data.append({'description': th_cell[0].get_text(),
'record' : td_cell[0].get_text()})
except IndexError:
pass
for item in data:
print data