phppdftesseractghostscriptxpdf

How to differentiate between "text" PDFs and "image" PDFs in PHP?


I've recently set up a Linux server to be able to convert text-based PDFs to text by using the pdftotext command that's part of Xpdf as well as to convert image-based PDFs to text by using a combination of the gs (Ghostscript) and tesseract commands.

Both solutions work pretty well when I already know whether a PDF is text-based or image-based. However, in order to automate the process of converting many PDFs to text, I need to be able to tell whether a PDF is text-based or image-based so that I know which set of processes to run on the PDF.

Is there any way in PHP to analyze a PDF and tell whether it's text-based or image-based so that I know whether to use Xpdf or Ghostscript/Tesseract on it?


Solution

  • I think the answer from Kurt Pfeifle here is superb: Use pdffonts - which is also part of Xpdf / Poppler - to list which fonts a PDF uses.

    If it uses any font, it contains text. If not, it contains only images.