pdf2image

pdf2image: how to remove the '0001' in jpg file names?


My goal is to convert a multi page pdf file into a number of .jpg files, in such a way that the images are directly written to the hard-disk/SSD in stead of stored into memory.

In python 3.11 :

from pdf2image import convert_from_path
poppler_path = r".\poppler-22.12.0\Library\bin"  

images = convert_from_path('test.pdf', output_folder='.', output_file = 'test', 
         poppler_path=poppler_path, paths_only = True)

pdf2image generates files with the following names 'test_0001-1.jpg', 'test_0001-2.jpg', etc

Problem: I would like to have the files have names without the suffix '_0001-' (eg. 'test1.jpg').

The only way so far seems to be to use convert_from_path WITHOUT output_folder and then save each images by images.save. But in this way the images are stored first into memory, which easyly can become a lot of Mbytes.

Is it possible to change the way pdf2image generates the file names when saving images directly to files?


Solution

  • Just use poppler utilities direct (or xpdf pdftopng) so simply call it via a shell (add other options like -r 200 as desired for resolutions other than 150)

    I recommend PNG as better image fidelity, however if you want .jpg replace "-png" below with "-jpg" (direct answer as asked would be pdftoppm -jpg -f 1 -l 9 -sep "" test.pdf "test") but do follow the below enhancement for file sorting. Windows file sorting needs leading zeros otherwise sort in zip or folder is 1,10,11...2,20...., which is often undesirable.

    "path to bin\pdftoppm" -png "path to \in.pdf" "name"

    Result =

    adding digits is limited compared to other apps so if you want "name-01.png" you need to only output pages 1-9 as

    \bin>pdftoppm -png -f 1 -l 9 -sep "0" in.pdf "name-"

    then for pages 10 to ## use say for up to 99 page file use default (it will only use the page numbers that are available)

    \bin>pdftoppm -png -f 10 -l 99 in.pdf "name"

    thus for 12 pages this would produce only -10 -11 and -12 as required

    likewise, for up to 9999 pages you need 4 calls, if you don't want - simply delete it. For different output directory adjust output accordingly.

    set "name=%~dpn1"
    set "bin=path to Poppler\Release-22.12.0-0\poppler-22.12.0\Library\bin"
    
    "%bin%\pdftoppm" -png -r 200 -f 1 -l 9 -sep "0" "%name%.pdf" "%name%-00"
    "%bin%\pdftoppm" -png -r 200 -f 10 -l 99 -sep "0" "%name%.pdf" "%name%-0"
    "%bin%\pdftoppm" -png -r 200 -f 100 -l 999 -sep "0" "%name%.pdf" "%name%-"
    "%bin%\pdftoppm" -png -r 200 -f 1000 -l 9999 -sep "" "%name%.pdf" "%name%-"
    

    in say example for 12 page above the worst case would be last calls replies
    Wrong page range given: the first page (100) can not be after the last page (12). and same for 1000 Thus, those warnings can be ignored.

    Those 4 lines could be in a windows or OS script batch file (for sendto or drag and drop) that accepts arguments then very simply use in system or python by call pdf2png.bat input.pdf for each file and output will in that simple case be same directory.