I scanned a book and OCR'd it using ABBYYY but all I really care about is the text from the OCR and that it's organized by page. Is there a tool I could use to drop all of the scanned page images but keep all of the OCR text? I realize it wouldn't be human readable at that point, but that's ok because I'm processing the PDF with Python scripts.
@johnwhitington's comment to the question worked great for me. but it's not a complete answer.
cpdf -draft in.pdf -o out.pdf
you can get cpdf from https://github.com/coherentgraphics/cpdf-binaries
the -draft
option removes images:
-draft Remove images from the file
You need to make sure you actually have text in the file first, of course - with Acrobat, that's the editable text and images
option in the OCR settings - if you can copy a block of text and paste it outside and get readable text, you might have a pdf that works for this.
This produces a perfectly human readable result (minus any supporting graphics, obviously).
further information and documentation on the cpdf tool can be found at:
https://www.coherentpdf.com https://www.coherentpdf.com/cpdfmanual.pdf
you may find a combination of -draft AND -blacktext
useful (I did)