I've build an application that uses Tesseract (V3.03 rc1) to identify some specific text strings. These are, unfortunately, printed on a custom font that requires that I build my own traineddata file. I've built the application on both iOS (using https://github.com/gali8/Tesseract-OCR-iOS for inspiration) and Android (using https://github.com/rmtheis/tess-two/ for inspiration as well).
The workflow for both platforms is as follows:
I select a bounding box on the preview screen for where I can crop out the relevant text, and crop the image accordingly.
I use OpenCV to get a binary image (using OpenCV's adaptive threshold function with the same parameters for both platforms)
I pass this binary image to Tesseract. Both platforms (Android and iOS) use the same traineddata file.
And yet, iOS recognizes the text strings perfectly, while Android keeps misidentifying certain characters (6s for Ss, As for Hs).
On both platforms, I use the same white list string, I disable load_type_dawg and load_system_dawg, and also choose to save the blob choices.
Has anyone encountered this kind of situation before? Am I missing a setting on Android that's automatically handled in iOS? Is there something particular about Android that hasn't crossed my mind?
Any thoughts or advice would be greatly appreciated!
So, after a lot of work, I found out what was wrong with my Android application (thankfully, it wasn't an issue with Tesseract at all). As I'm more familiar with iOS apps than Android, I wasn't sure how I could load the traineddata file onto the application without requiring the user to have the file loaded on their external storage device. I found inspiration in this project (http://www.codeproject.com/Tips/840623/Android-Character-Recognition), as they autoload the trained data file.
However, I misunderstood how it worked. I originally thought that the TessDataManager did a file lookup on the project's local tesseract/tessdata folder in order to get the trained data file (as I do this also on iOS). However, that's not what it does. It, rather, checks the internal file structure (data/data/projectname/files/tesseract/tessdata/traineddatafilegoeshere) to see if the file exists and if it doesn't, it copies over the trained data file it keeps in the Resources/Raw directory. In my case, it defaulted to the eng file, so it never read my custom font file.
Hopefully this helps someone else having similar issues. Thanks to Robin and RmTheis for all of your help!