I am trying to read the images using PyPDF2 and convert them into the text. I have around 3000 pdf invoices and I want to get the description of the product, quantity sold and price. The problem is PyPDF2 is reading the string but spitting it out in weird order. I want it to read the images and extract the text from top to bottom and left to right. Is there anyway I can do that. Here is what I am doing so far.
from PyPDF2 import PdfReader
reader = PdfReader(invoice)
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
pprint(text)
I have also tried Pytesseract and it also gave the same results. The weird thing is that in some files, it is spitting out fine and the way I want. Is there anyway I can change the settings, and make it read from top to bottom first and left to right.
This is the part of the invoice where it has the description, qty and unit price. I want it to read one record in a sequential order. Thanks
PyPDF2 is deprecated. Use pypdf. I'm the maintainer of both projects.
I am trying to read the images using PyPDF2
pypdf is not OCR software and thus cannot extract text from images. tesseract would be OCR software; see scope of pypdf
The problem is PyPDF2 is reading the string but spitting it out in weird order.
pypdf uses the order in which it appears in the file. That is not necessarily the same order as it's visually.
The problem is that PDF absolutely positions tokens.
Imagine you have a sentence "This is a Hello World". Then within the document you might see
("This", x=0, y=0)
("Hello", x=30, y=0)
("World", x=40, y=0)
("is", x=10, y=0)
("a", x=20, y=0)
Here pypdf would extract "This Hello World is a".
Typically this doesn't happen, though. PDF generators typically also have a reasonable order within the document. It depends on how the PDF is generated.
You might want to give https://pypi.org/project/pypdfium2/ a try:
python3 -m pip install -U pypdfium2
Code:
import pypdfium2 as pdfium
def pdfium_get_text(data: bytes) -> str:
text = ""
pdf = pdfium.PdfDocument(data)
for i in range(len(pdf)):
page = pdf.get_page(i)
textpage = page.get_textpage()
text += textpage.get_text_range() + "\n"
return text
with open("example.pdf", "rb") as f:
data = f.read()
print(pdfium_get_text(data))