Context:
I have PDF files I'm working with.
I'm using an ocr
to extract the text from these documents and to be able to do that I have to convert my pdf files to images.
I currently use the convert_from_path
function of the pdf2image
module but it is very time inefficient (9minutes for a 9page pdf).
Problem:
I am looking for a way to accelerate this process or another way to convert my PDF files to images.
Additional info:
I am aware that there is a thread_count
parameter in the function but after several tries it doesn't seem to make any difference.
This is the whole function I am using:
def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin')
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
Link to the convert_from_path reference.
I found an answer to that problem using another module called fitz
which is a python binding to MuPDF
.
First of all install PyMuPDF:
The documentation can be found here but for windows users it's rather simple:
pip install PyMuPDF
Then import the fitz module:
import fitz
print(fitz.__doc__)
>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).
Open your file and save every page as images:
The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.
def convert_pdf_to_image(fic):
#open your file
doc = fitz.open(fic)
#iterate through the pages of the document and create a RGB image of the page
for page in doc:
pix = page.get_pixmap()
pix.save("page-%i.png" % page.number)
Hope this helps anyone else.