I am trying to scrape this PDF with ScraperWiki. The current code gives me an error of name 'data' is not defined but I receive the error on
elif int(el.attrib['left']) < 647: data['Neighborhood'] = el.text
If i comment that line out i get the same error on my else statement.
Here is my code
import scraperwiki
import urllib2, lxml.etree
#Pull Mondays
url = 'http://www.city.pittsburgh.pa.us/police/blotter/blotter_monday.pdf'
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)
# how many pages in PDF
pages = list(root)
print "There are",len(pages),"pages"
# Test Scrape of only Page 1 of 29
for page in pages[0:1]:
for el in page:
if el.tag == "text":
if int(el.attrib['left']) < 11: data = { 'Report Name': el.text }
elif int(el.attrib['left']) < 317: data['Location of Occurrence'] = el.text
elif int(el.attrib['left']) < 169: data['Incident Time'] = el.text
elif int(el.attrib['left']) < 647: data['Neighborhood'] = el.text
elif int(el.attrib['left']) < 338: data['Description'] = el.text
else:
data['Zone'] = el.text
print data
What am I doing wrong?
Also any suggestions of a better solution would be appreciated.
Unless you've skipped some of your code, your data
dictionary only gets created if the condition in this line is matched:
if int(el.attrib['left']) < 11: data = { 'Report Name': el.text }
All of your other lines where you set values in data
depend on it already existing, so you'll get the NameError
if this first condition isn't matched.
The quick fix would be to always create an empty data dictionary, e.g.
for page in pages[0:1]:
for el in page:
data = {}
if el.tag =="text":
etc.