I'm using the latest version of Tesseract (5.0), and I'm trying to determine whether or not I can insert some preprocessing steps that will -not- affect the form of the final image.
For example, I might start out with an image such as this.
There are different levels of shadow/brightness, so I might use adaptive Gaussian thresholding to avoid shadows during binarization.
I will now run this through tesseract, with the hope of creating an OCR'd PDF in the end. However, I want the image that the end user (and I) see to be the full-color, original image, with the text from the transformed image underlaid
Is there a way to manage this? Or am I completely missing the point here.
I was provided an answer on another forum, and wanted to share it here.
Instead of using the built in PDF option in Tesseract, I used the hOCR setting. My pipeline went: