pythonpypdfpdf-reader

PdfReader from PyPDF2 is not reading from top to bottom and left to right in the sequence


I am trying to read the images using PyPDF2 and convert them into the text. I have around 3000 pdf invoices and I want to get the description of the product, quantity sold and price. The problem is PyPDF2 is reading the string but spitting it out in weird order. I want it to read the images and extract the text from top to bottom and left to right. Is there anyway I can do that. Here is what I am doing so far.

from PyPDF2 import PdfReader

reader = PdfReader(invoice)
number_of_pages = len(reader.pages)
page = reader.pages[0]

text = page.extract_text()
pprint(text)

I have also tried Pytesseract and it also gave the same results. The weird thing is that in some files, it is spitting out fine and the way I want. Is there anyway I can change the settings, and make it read from top to bottom first and left to right.

This is the part of the invoice where it has the description, qty and unit price. I want it to read one record in a sequential order. Thanks

enter image description here


Solution

  • PyPDF2 is deprecated. Use pypdf. I'm the maintainer of both projects.

    I am trying to read the images using PyPDF2

    pypdf is not OCR software and thus cannot extract text from images. would be OCR software; see scope of pypdf

    The problem is PyPDF2 is reading the string but spitting it out in weird order.

    pypdf uses the order in which it appears in the file. That is not necessarily the same order as it's visually.

    The problem is that PDF absolutely positions tokens.

    Imagine you have a sentence "This is a Hello World". Then within the document you might see

    ("This", x=0, y=0)
    ("Hello", x=30, y=0)
    ("World", x=40, y=0)
    ("is", x=10, y=0)
    ("a", x=20, y=0)
    

    Here pypdf would extract "This Hello World is a".

    Typically this doesn't happen, though. PDF generators typically also have a reasonable order within the document. It depends on how the PDF is generated.

    Alternative: PyPDFium2

    You might want to give https://pypi.org/project/pypdfium2/ a try:

    python3 -m pip install -U pypdfium2
    

    Code:

    import pypdfium2 as pdfium
    
    def pdfium_get_text(data: bytes) -> str:
        text = ""
        pdf = pdfium.PdfDocument(data)
        for i in range(len(pdf)):
            page = pdf.get_page(i)
            textpage = page.get_textpage()
            text += textpage.get_text_range() + "\n"
        return text
    
    with open("example.pdf", "rb") as f:
        data = f.read()
    print(pdfium_get_text(data))