macospdfocrabbyy

Is it possible to remove PDF's images and only keep the OCR'd text?


I scanned a book and OCR'd it using ABBYYY but all I really care about is the text from the OCR and that it's organized by page. Is there a tool I could use to drop all of the scanned page images but keep all of the OCR text? I realize it wouldn't be human readable at that point, but that's ok because I'm processing the PDF with Python scripts.


Solution

  • @johnwhitington's comment to the question worked great for me. but it's not a complete answer.

    cpdf -draft in.pdf -o out.pdf

    you can get cpdf from https://github.com/coherentgraphics/cpdf-binaries

    the -draft option removes images:

      -draft Remove images from the file
    

    You need to make sure you actually have text in the file first, of course - with Acrobat, that's the editable text and images option in the OCR settings - if you can copy a block of text and paste it outside and get readable text, you might have a pdf that works for this.

    This produces a perfectly human readable result (minus any supporting graphics, obviously).

    further information and documentation on the cpdf tool can be found at:

    https://www.coherentpdf.com https://www.coherentpdf.com/cpdfmanual.pdf

    you may find a combination of -draft AND -blacktext useful (I did)