I'm trying to read in some pdfs located in a directory, and outputting images of their pages in a different directory.
(I'm seeking to learn how this code works and I am hoping there's a cleaner way to specify an output directory for my image files.)
What I've done works, but I think it is just bouncing back and forth between my save directory and my pdf directory.
This doesn't feel like a clean approach. Is there a better option, which preserves the existing code and accomplishes what my added lines do?
import os
from pdf2image import convert_from_path
pdf_dir = r"mydirectorypathwithPDFs"
save_dir = 'mydirectorypathforimages'
os.chdir(pdf_dir)
for pdf_file in os.listdir(pdf_dir):
os.chdir(pdf_dir) #I added this, change back to the pdf directory
if pdf_file.endswith(".pdf"):
pages = convert_from_path(pdf_file, 300)
pdf_file = pdf_file[:-4]
for page in pages:
os.chdir(save_dir) #I added this, change to the save directory
page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")
The code I slightly modified was created by @photek1944 and found here: https://stackoverflow.com/a/53463015/10216912
This might go a little beyond the scope of exactly what you asked, but anytime someone's looking to streamline code involving os
for manipulating paths and files, I always like to recommend Python's pathlib
module, because it is awesome. Here's how I personally would implement your program:
from pathlib import Path
from pdf2image import convert_from_path
# Use forward slashes here, even if you're on Windows.
pdf_dir = Path('my/directory/path/with/PDFs')
save_dir = Path('my/directory/path/for/images')
for pdf_file in pdf_dir.glob('*.pdf'):
pages = convert_from_path(pdf_file, 300)
for num, page in enumerate(pages, start=1):
page.save(save_dir / f'{pdf_file.stem}-page{num}.jpg', 'JPEG')
pathlib
automatically handles providing the right separator (\
on Windows and /
mostly everywhere else), it lets you add onto paths with /
as an operator, and it makes searching through a folder particularly convenient with the glob
method. It also exposes properties like name
(blah.pdf
), stem
(blah
), and extension
(.pdf
) to more easily access the parts of the path and file name.
I'm also using an f-string for more readable formatting, and enumerate
to track the page numbers. (I've set it to start at 1
; I believe your original code would number the first page as 0
.)