pythonimagepdfocrpdf2image

Time efficient way to convert PDF to image


Context:

I have PDF files I'm working with. I'm using an ocr to extract the text from these documents and to be able to do that I have to convert my pdf files to images. I currently use the convert_from_path function of the pdf2image module but it is very time inefficient (9minutes for a 9page pdf).

Problem:

I am looking for a way to accelerate this process or another way to convert my PDF files to images.

Additional info:

I am aware that there is a thread_count parameter in the function but after several tries it doesn't seem to make any difference.

This is the whole function I am using:

def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable 
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin') 

image_counter = 0

# Iterate through all the pages stored above 
for page in pages: 
    filename = "page_"+str(image_counter)+".jpg"
    page.save(output_folder+filename, 'JPEG') 
    image_counter = image_counter + 1
    
for i in os.listdir(output_folder):
    if i.endswith('.ppm'):
        os.remove(output_folder+i)

Link to the convert_from_path reference.


Solution

  • I found an answer to that problem using another module called fitz which is a python binding to MuPDF.

    First of all install PyMuPDF:

    The documentation can be found here but for windows users it's rather simple:

    pip install PyMuPDF
    

    Then import the fitz module:

    import fitz
    print(fitz.__doc__)
    
    >>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
    >>>Version date: 2021-05-05 06:32:22.
    >>>Built for Python 3.7 on win32 (64-bit).
    

    Open your file and save every page as images:

    The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.

    def convert_pdf_to_image(fic):
        #open your file
        doc = fitz.open(fic)
        #iterate through the pages of the document and create a RGB image of the page
        for page in doc:
            pix = page.get_pixmap()
            pix.save("page-%i.png" % page.number)
    

    Hope this helps anyone else.