image-processingpython-tesseractpdfminer

Read image based pdfs with pdfminer in a page by page fashion


I'm running a script that uses pdfminer to split pages and analyze documents in a page by page basis. My script goes page by page like this:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from pdfminer.pdfdocument import PDFDocument
import pytesseract

fp = open(pdf_path, 'rb')
data = []
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.

for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
    # Read PDF page, write text into stream
    interpreter.process_page(page)
    text = retstr.getvalue()

However, sometimes I get pdfs that are image based and my text variable gets empty. I couldn't find a "convert_image_to_string" pdfminer function so I found an option with pdf2image

for pageNumber, page in enumerate(PDFPage.get_pages(fp)): #previous code
    # Read PDF page, write text into stream               #previous code
    interpreter.process_page(page)                        #previous code
    text = retstr.getvalue()                              #previous code

    if len(text)<100:                                     #new code
        from pdf2image import convert_from_path           #new code
        img=convert_from_path(page,350)                   #new code
        text=pytesseract.image_to_string(page)            #new code

But I need to input a file path in pdf2image.convert_from_path and since my previous code has a pdfminer page object as output, the return I get is TypeError: expected str, bytes or os.PathLike object, not PDFPage. So, I would very much appreciate a suggestion to:

a) Use pdfminer to convert image pdfs to text or;

b) Use pdfminer to save pdf page somewhere in a way I could use the file_path as input to pdf2image.covert_from_path


Solution

  • Well, no one answered and the workaround I found was giving up on pdf miner and focusing on pdf2image and pytesseract. Hope it helps someone with the same prob.

    from pdf2image import convert_from_path
    import pytesseract
    
    pdf=PdfFileReader(pdf_path)
    numpages=pdf.getNumPages()
    
    for pageNumber in range(numpages):
        page = pdf.getPage(pageNumber)
        text=page.extractText()
        if len(text)<100:
            pdfWriter=PdfFileWriter()
            pdfWriter.addPage(pdf.getPage(pageNumber))
            with open("pdfpage.pdf", 'wb') as f:
                pdfWriter.write(f)
                f.close()
            
            imgpath="/Users/pdfpage.pdf"
            img=convert_from_path(imgpath,350)[0]
            try:text=pytesseract.image_to_string(img)
            except:text="no text"