python-2.7pdfscraperwiki

Using scraperwiki for pdf-file on disk


I am trying to get some data out of a pdf document using scraperwiki for pyhon. It works beautifully if I download the file using urllib2 like so:

pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)
pages = list(root)

But here comes the tricky part. As I would like to do this for a large number of pdf-files that I have on my disk, I would like to do away with the first line and pass the pdf file directly as an argument. However, if I try

pdfdata = open("filename.pdf","wb")
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)

I get the following error

xmldata = scraperwiki.pdftoxml(pdfdata)
File "/usr/local/lib/python2.7/dist-packages/scraperwiki/utils.py", line 44, in pdftoxml
pdffout.write(pdfdata)
TypeError: must be string or buffer, not file

I am guessing that this occurs because I do not open the pdf correctly?

If so, is there a way to open a pdf from disk just like urllib2.urlopen() does?


Solution

  • urllib2.urlopen(...).read() does just that it reads the contents of the stream returned from the url you passed as a parameter.

    While open() returns a file handler. Just as urllib2 needed to do an open() call then a read() call so does file handlers.

    Change your program to use the the following lines:

    with open("filename.pdf", "rb") as pdffile:
          pdfdata=pdffile.read()
    
    xmldata = scraperwiki.pdftoxml(pdfdata)
    root = lxml.html.fromstring(xmldata)
    

    This will open your pdf then read the contents into a buffer named pdfdata. From there your call to scraperwiki.pdftoxml() will work as expected.