[SOLVED] Tesseract - Preprocessing that Doesn't Affect Final Image

Tesseract - Preprocessing that Doesn't Affect Final Image

I'm using the latest version of Tesseract (5.0), and I'm trying to determine whether or not I can insert some preprocessing steps that will -not- affect the form of the final image.

For example, I might start out with an image such as this.

There are different levels of shadow/brightness, so I might use adaptive Gaussian thresholding to avoid shadows during binarization.

I will now run this through tesseract, with the hope of creating an OCR'd PDF in the end. However, I want the image that the end user (and I) see to be the full-color, original image, with the text from the transformed image underlaid

Is there a way to manage this? Or am I completely missing the point here.

Solution

I was provided an answer on another forum, and wanted to share it here.

Instead of using the built in PDF option in Tesseract, I used the hOCR setting. My pipeline went:

Preprocess image (thresholding, etc)
Run tesseract with the following command: tesseract example1.jpg example1 -l eng hocr
Use the hocr-pdf module from Ocropus to merge the hocr'd material with the ORIGINAL IMAGE, no preprocessing.