I am currently using pdftotext
to read PDF files into python using the following code
import pdftotext
bill_full = []
with open('sample.pdf', "rb") as f:
pdf = pdftotext.PDF(f)
bill = ''
for page in pdf:
bill = bill + page
bill_full.append(bill)
The previous code seems to mostly work for my complete dataset, however I seem to encounter seemingly random errors. The previous code applied to the following PDF https://legiscan.com/WI/text/AB649/id/456434/Wisconsin-2009-AB649-Introduced.pdf results in
2011 − 2012 LEGISLATURE LRB−1478/1 2011 SENATE BILL 27\r\n\r\n\r\n\r\n\r\n March 1, 2011 − Introduced by JOINT COMMITTEE ON FINANCE. Referred to Joint\r\n Committee on Finance.\r\n\r\n\r\n\r\n\r\n1 AN ACT relating to: state finances and appropriations, constituting the\r\n\r\n2 executive budget act of the 2011 legislature.\r\n\r\n\r\n Analysis by the Legislative Reference Bureau\r\n INTRODUCTION\r\n
However when applied to others (eg. https://legiscan.com/WI/text/AB408/id/423828/Wisconsin-2009-AB408-Introduced.pdf) I get the following sequence of characters:
\x08\x08\x11 \x06 \x08 \x08 \x1c\x18\x1a\x1b"\x1c\x14#$!\x18
What is different in these two PDFs? Ideally I would like to detect "unreadable" PDFs and drop them from my analysis.
To answer the direct question what is different is the CID data so lets just look at one object on each page 1. here I pick the subject of your question, the first text that includes the numbers 1 2 9 0, letters L E G I S A T U R and the others in title
Here we see good or bad they are all stored as the same font type ??????+PSOwstnewcspsb, unclear to me but seems to be named along the lines PSO WeSTern NEW Courier ??? Bold
So why would there then be some working as mapped correctly by say OCR and some not ? That is an unknown to me and there is often no clear rhyme or reason, but we can see a difference in outcomes as the good one starts with printable space (/FirstChar 32/LastChar 116) whilst both of the non working ones start (/FirstChar 0/LastChar ## of approx 66) i.e. include a non standard printing range. That however is not an indicator of a bad font and in other bad examples I have seen /FirstChar 2 as giving a hint to a poorly defined font. the problem with searching /FirstChar is it may be encrypted or encode thus not possible to look for in many pdfs until disassembled.
The only good indication of bad characters is good plain text extraction contains invalid print characters.
You say you wish to avoid files with bad construct but many files may only have bad parts of pages, for a wider example of this issue see How to identify likely broken pdf pages before extracting its text?