I know tika supports tesseract ocr. I want to use my own custom ocr library with tika instead of tesseract. How to achieve this?
I also want to use the custom library for pdf parsing too as per tika's auto ocr-strategy
I came across this wiki which describes how to use tesseract with tika but i find no instructions on how to replace it with other custom ocr library of our choice.
(I am currently using tika as jar and not as tika server; if that is relevant)
The solution is to create a custom OCR parser to override TesseractOCRParser. Since OCR parser selection follows the same mechanism used to select other parsers.
All I had to do is make it return the same ocr types supported by TesseractOCRParser via getSupportedTypes() method. Which is "ocr-" prepended to actual mime sub type eg: "image/ocr-png".
These are not standard mime types but tika specific types which are dynamically checked for supporting ocr parser availability while parsing images keeping the non ocr standard parsers as fallback. This ensures you don't need to (and one shouldn't...for reasons) set the supported types of ocr parsers to standard mime types instead of OCRed mimetypes
Then override the parse method to put custom OCR logic
and creating a new AutodetectParser with the above custom parser put in the registry will do the work
in case of competition from TesseractOCRParser either uninstall tesseract from host--in which case the parser will only return an empty set for getSupportedTypesMethod, or exclude the parser from config
one can also manually tweak the registry of AutodetectParser to suit one's needs