pythonpdfpython-3.xscraperwiki

Scraping a PDF with ScraperWiki and getting an Error of not Defined


I am trying to scrape this PDF with ScraperWiki. The current code gives me an error of name 'data' is not defined but I receive the error on

elif int(el.attrib['left']) < 647: data['Neighborhood'] = el.text

If i comment that line out i get the same error on my else statement.

Here is my code

import scraperwiki
import urllib2, lxml.etree
#Pull Mondays
url = 'http://www.city.pittsburgh.pa.us/police/blotter/blotter_monday.pdf'
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)
# how many pages in PDF
pages = list(root)
print "There are",len(pages),"pages"
# Test Scrape of only Page 1 of 29
for page in pages[0:1]:
    for el in page:
        if el.tag == "text":
            if int(el.attrib['left']) < 11: data = { 'Report Name': el.text }
            elif int(el.attrib['left']) < 317: data['Location of Occurrence'] = el.text
            elif int(el.attrib['left']) < 169: data['Incident Time'] = el.text
            elif int(el.attrib['left']) < 647: data['Neighborhood'] = el.text
            elif int(el.attrib['left']) < 338: data['Description'] = el.text
            else:
                data['Zone'] = el.text
                print data

What am I doing wrong?

Also any suggestions of a better solution would be appreciated.


Solution

  • Unless you've skipped some of your code, your data dictionary only gets created if the condition in this line is matched:

    if int(el.attrib['left']) < 11: data = { 'Report Name': el.text }

    All of your other lines where you set values in data depend on it already existing, so you'll get the NameError if this first condition isn't matched.

    The quick fix would be to always create an empty data dictionary, e.g.

    for page in pages[0:1]:
        for el in page:
            data = {}
            if el.tag =="text":
    

    etc.