pythonpdfparallel-processingmultiprocessingpymupdf

Convert multiple-page PDF files to PNG quickly


I have a folder containing 600 PDF files, and each PDF has 20 pages. I need to convert each page into a high-quality PNG as quickly as possible.

I wrote the following script for this task:

import os
import multiprocessing
import fitz  # PyMuPDF
from PIL import Image

def process_pdf(pdf_path, output_folder):
    try:
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
        pdf_output_folder = os.path.join(output_folder, pdf_name)
        os.makedirs(pdf_output_folder, exist_ok=True)

        doc = fitz.open(pdf_path)

        for i, page in enumerate(doc):
            pix = page.get_pixmap(dpi=850)  # Render page at high DPI
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            
            img_path = os.path.join(pdf_output_folder, f"page_{i+1}.png")
            img.save(img_path, "PNG")

        print(f"Processed: {pdf_path}")
    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")

def main():
    input_folder = r"E:\Desktop\New folder (5)\New folder (4)"
    output_folder = r"E:\Desktop\New folder (5)\New folder (5)"

    pdf_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.lower().endswith(".pdf")]

    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        pool.starmap(process_pdf, [(pdf, output_folder) for pdf in pdf_files])

    print("All PDFs processed successfully!")

if __name__ == "__main__":
    main()

Issue:

This script is too slow, especially when processing a large number of PDFs. I tried the following optimizations, but they did not improve speed significantly:

Possible Solutions I Considered:

What I Need Help With:

Any suggestions would be greatly appreciated!


Solution

  • Not only is this process highly CPU intensive, it also requires significant RAM. On MacOS (M2) running on just 4 CPUs (i.e., half the number available) improves performance significantly. Even so, the average time to process a page is ~1.3s

    For this test I have 80 PDFs. A maximum of 20 pages is processed per PDF.

    Here's the test:

    import fitz
    from pathlib import Path
    from multiprocessing import Pool
    from PIL import Image
    from time import monotonic
    from os import process_cpu_count
    
    SOURCE_DIR = Path("/Volumes/Spare/Downloads")
    TARGET_DIR = Path("/Volumes/Spare/PDFs")
    
    def cpus() -> int:
        if ncpus := process_cpu_count():
            ncpus //= 2
            return ncpus if ncpus > 1 else 2
        return 2
        
    def process(path: Path) -> tuple[float, int]:
        print(f"Processing {path.name}")
        try:
            with fitz.open(path) as pdf:
                start = monotonic()
                for i, page in enumerate(pdf.pages(), 1):
                    pix = page.get_pixmap(dpi=850)
                    img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
                    img_path = TARGET_DIR / f"{path.stem}_page_{i}.png"
                    img.save(img_path, "PNG")
                    if i >= 20:
                        break
                return (monotonic() - start, i)
        except Exception:
            pass
        return (0.0, 0)
    
    def main() -> None:
        TARGET_DIR.mkdir(parents=True, exist_ok=True)
        with Pool(cpus()) as pool:
            sum_d = 0.0
            sum_p = 0
            for duration, page_count in pool.map(process, SOURCE_DIR.glob("*.pdf")):
                sum_d += duration
                sum_p += page_count
            if sum_p > 0:
                print(f"Average duration per page = {sum_d/sum_p:,.4f}s")
            else:
                print("No files were processed")
    
    if __name__ == "__main__":
        main()
    

    Output excluding filenames:

    Average duration per page = 1.2667s

    Summary:

    Rendering at 850dpi with fitz / PyMuPDF is slow. Reducing the render to, for example, 300dpi decreased the per page timing to ~0.17s