I have a large collection of documents scanned into PDF format, and I wish to write a shell script that will convert each document to DjVu format. Some documents were scanned at 200dpi, some at 300dpi, and some at 600dpi. Since DjVu is a pixel-based format, I want to be sure I use the same resolution in the target DjVu file as was used for the scan.
Does anyone know what program I can run, or how I can write a program, to determine what resolution was used to produce a scanned PDF? (Number of pixels might work too as almost all documents are 8.5 by 11 inches.)
Clarification after responses: I'm aware of the difficulties highlighted by Breton, and I'm willing to concede that the problem in general is ill-posed, but I'm not asking about general PDF documents. My particular documents came out of a scanner. They contain one scanned image per page, same resolution each page. If I convert the PDF to PostScript I can poke around by hand and find pixel dimensions easily; I could probably find image sizes with more work. And if in desperate need I could modify the dictionary stack that gs
is using; long ago, I wrote an interpreter for PostScript Level 1.
All of that is what I'm trying to avoid.
Thanks to help received, I've posted an answer below:
identify
, taking only the output for the first page, and understanding that the units will be PostScript points, of which there are 72 to an inch.pdfimages
.identify
will give number of pixels.Full answer with script is below. I'm using it in live fire and it works great. Thanks Harlequin for pdfimages
and Spiffeah for the alert about multiple images per page (it's rare, but I've found some).
I guess that the scans are included as images in the PDF, so you could use pdfimages
to extract them first. Then, identify
should be able to find the correct data.