pythonpdfdirectorychdir

Extract pdfs from a directory and output images to a different directory with pdf2image


I'm trying to read in some pdfs located in a directory, and outputting images of their pages in a different directory.

(I'm seeking to learn how this code works and I am hoping there's a cleaner way to specify an output directory for my image files.)

What I've done works, but I think it is just bouncing back and forth between my save directory and my pdf directory.

This doesn't feel like a clean approach. Is there a better option, which preserves the existing code and accomplishes what my added lines do?

import os
from pdf2image import convert_from_path

pdf_dir = r"mydirectorypathwithPDFs"
save_dir = 'mydirectorypathforimages'

os.chdir(pdf_dir)

for pdf_file in os.listdir(pdf_dir):
    os.chdir(pdf_dir) #I added this, change back to the pdf directory
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_file = pdf_file[:-4]
        for page in pages:
            os.chdir(save_dir) #I added this, change to the save directory
            page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

The code I slightly modified was created by @photek1944 and found here: https://stackoverflow.com/a/53463015/10216912


Solution

  • This might go a little beyond the scope of exactly what you asked, but anytime someone's looking to streamline code involving os for manipulating paths and files, I always like to recommend Python's pathlib module, because it is awesome. Here's how I personally would implement your program:

    from pathlib import Path
    from pdf2image import convert_from_path
    
    # Use forward slashes here, even if you're on Windows.
    pdf_dir = Path('my/directory/path/with/PDFs')
    save_dir = Path('my/directory/path/for/images')
    
    for pdf_file in pdf_dir.glob('*.pdf'):
        pages = convert_from_path(pdf_file, 300)
        for num, page in enumerate(pages, start=1):
            page.save(save_dir / f'{pdf_file.stem}-page{num}.jpg', 'JPEG')
    

    pathlib automatically handles providing the right separator (\ on Windows and / mostly everywhere else), it lets you add onto paths with / as an operator, and it makes searching through a folder particularly convenient with the glob method. It also exposes properties like name (blah.pdf), stem (blah), and extension (.pdf) to more easily access the parts of the path and file name.

    I'm also using an f-string for more readable formatting, and enumerate to track the page numbers. (I've set it to start at 1; I believe your original code would number the first page as 0.)