How can I efficiently save a particular page of a PDF as a jpeg
file using Python?
I have a Python Flask web server where PDFs will be uploaded and I want to also store jpeg
files that correspond to each PDF page.
This solution is close but it does not result in the entire page being converted to a jpeg
.
The pdf2image library can be used.
You can install it simply using,
pip install pdf2image
Once installed you can use following code to get images.
from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)
Saving pages in jpeg format
for count, page in enumerate(pages):
page.save(f'out{count}.jpg', 'JPEG')
Edit: the Github repo pdf2image also mentions that it uses pdftoppm
and that it requires other installations:
pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install [poppler for Windows] see ** below Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run
sudo apt install poppler-utils
.
You can install the latest version under Windows using anaconda by doing:
conda install -c conda-forge poppler
** note: Windows 64 bit versions upto 24.08 are available at https://github.com/oschwartz10612/poppler-windows but note that for 32 bit 22.02 was the last one included in TeXLive 2022 (https://poppler.freedesktop.org/releases.html) so you'll not be getting the latest features or bug fixes.