I have a folder containing 600 PDF files, and each PDF has 20 pages. I need to convert each page into a high-quality PNG as quickly as possible.
I wrote the following script for this task:
import os
import multiprocessing
import fitz # PyMuPDF
from PIL import Image
def process_pdf(pdf_path, output_folder):
try:
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
pdf_output_folder = os.path.join(output_folder, pdf_name)
os.makedirs(pdf_output_folder, exist_ok=True)
doc = fitz.open(pdf_path)
for i, page in enumerate(doc):
pix = page.get_pixmap(dpi=850) # Render page at high DPI
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
img_path = os.path.join(pdf_output_folder, f"page_{i+1}.png")
img.save(img_path, "PNG")
print(f"Processed: {pdf_path}")
except Exception as e:
print(f"Error processing {pdf_path}: {e}")
def main():
input_folder = r"E:\Desktop\New folder (5)\New folder (4)"
output_folder = r"E:\Desktop\New folder (5)\New folder (5)"
pdf_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.lower().endswith(".pdf")]
with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
pool.starmap(process_pdf, [(pdf, output_folder) for pdf in pdf_files])
print("All PDFs processed successfully!")
if __name__ == "__main__":
main()
Issue:
This script is too slow, especially when processing a large number of PDFs. I tried the following optimizations, but they did not improve speed significantly:
alpha=False
in get_pixmap()
– Reduced memory usage.ThreadPoolExecutor
instead of multiprocessing.Pool
– No major improvement.optimize=False
when saving images.Possible Solutions I Considered:
ProcessPoolExecutor
instead of ThreadPoolExecutor
– Since rendering is CPU-intensive, multiprocessing should be
better.What I Need Help With:
Any suggestions would be greatly appreciated!
Not only is this process highly CPU intensive, it also requires significant RAM. On MacOS (M2) running on just 4 CPUs (i.e., half the number available) improves performance significantly. Even so, the average time to process a page is ~1.3s
For this test I have 80 PDFs. A maximum of 20 pages is processed per PDF.
Here's the test:
import fitz
from pathlib import Path
from multiprocessing import Pool
from PIL import Image
from time import monotonic
from os import process_cpu_count
SOURCE_DIR = Path("/Volumes/Spare/Downloads")
TARGET_DIR = Path("/Volumes/Spare/PDFs")
def cpus() -> int:
if ncpus := process_cpu_count():
ncpus //= 2
return ncpus if ncpus > 1 else 2
return 2
def process(path: Path) -> tuple[float, int]:
print(f"Processing {path.name}")
try:
with fitz.open(path) as pdf:
start = monotonic()
for i, page in enumerate(pdf.pages(), 1):
pix = page.get_pixmap(dpi=850)
img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
img_path = TARGET_DIR / f"{path.stem}_page_{i}.png"
img.save(img_path, "PNG")
if i >= 20:
break
return (monotonic() - start, i)
except Exception:
pass
return (0.0, 0)
def main() -> None:
TARGET_DIR.mkdir(parents=True, exist_ok=True)
with Pool(cpus()) as pool:
sum_d = 0.0
sum_p = 0
for duration, page_count in pool.map(process, SOURCE_DIR.glob("*.pdf")):
sum_d += duration
sum_p += page_count
if sum_p > 0:
print(f"Average duration per page = {sum_d/sum_p:,.4f}s")
else:
print("No files were processed")
if __name__ == "__main__":
main()
Output excluding filenames:
Average duration per page = 1.2667s
Summary:
Rendering at 850dpi with fitz / PyMuPDF is slow. Reducing the render to, for example, 300dpi decreased the per page timing to ~0.17s