I have a script in which I go through and parse a large collection of PDFs. I noticed that when I tried to parse a particular PDF, the script just stalls forever. But it doesn't throw up an error and as far as I can tell, the PDF is not corrupted. I can't tell what the issue is, but I can see that it happens on page 4. Is there a way to find out what is causing this issue, or to just skip the PDF if it is taking longer than one minute to parse?
For reference, here is the PDF: https://go.boarddocs.com/fl/palmbeach/Board.nsf/files/CTWGW9459021/$file/22C-001R_2ND%20RENEWAL%20CONTRACT_TERRACON.pdf
from PyPDF2 import PdfReader
doc = "somefile.pdf"
doc_text = ""
try:
print(doc)
reader = PdfReader(doc)
for i in range(len(reader.pages)):
print(i)
page = reader.pages[i]
text = page.extract_text()
doc_text += text
except Exception as e:
print(f"The file failed due to error {e}:")
doc_text = ""
You should not use PyPDF2 any more unless really required and switch to pypdf instead - see the note on PyPI as well: https://pypi.org/project/PyPDF2/
Running the corresponding migrated code with the latest release does not show any performance issues:
from pypdf import PdfReader
doc = "78867160.pdf"
doc_text = ""
try:
print(doc)
reader = PdfReader(doc)
for i, page in enumerate(reader.pages):
print(i)
text = page.extract_text()
doc_text += text
except Exception as e:
print(f"The file failed due to error {e}:")
doc_text = ""