opencvocrtesseractimage-thresholding

Tesseract - Preprocessing that Doesn't Affect Final Image


I'm using the latest version of Tesseract (5.0), and I'm trying to determine whether or not I can insert some preprocessing steps that will -not- affect the form of the final image.

For example, I might start out with an image such as this.

There are different levels of shadow/brightness, so I might use adaptive Gaussian thresholding to avoid shadows during binarization.

I will now run this through tesseract, with the hope of creating an OCR'd PDF in the end. However, I want the image that the end user (and I) see to be the full-color, original image, with the text from the transformed image underlaid

Is there a way to manage this? Or am I completely missing the point here.


Solution

  • I was provided an answer on another forum, and wanted to share it here.

    Instead of using the built in PDF option in Tesseract, I used the hOCR setting. My pipeline went:

    1. Preprocess image (thresholding, etc)
    2. Run tesseract with the following command: tesseract example1.jpg example1 -l eng hocr
    3. Use the hocr-pdf module from Ocropus to merge the hocr'd material with the ORIGINAL IMAGE, no preprocessing.