parsingocrtesseractpdfboxapache-tika

How to use a custom OCR implementation with tika?


I know tika supports tesseract ocr. I want to use my own custom ocr library with tika instead of tesseract. How to achieve this?

I also want to use the custom library for pdf parsing too as per tika's auto ocr-strategy

I came across this wiki which describes how to use tesseract with tika but i find no instructions on how to replace it with other custom ocr library of our choice.

(I am currently using tika as jar and not as tika server; if that is relevant)


Solution

  • The solution is to create a custom OCR parser to override TesseractOCRParser. Since OCR parser selection follows the same mechanism used to select other parsers.

    All I had to do is make it return the same ocr types supported by TesseractOCRParser via getSupportedTypes() method. Which is "ocr-" prepended to actual mime sub type eg: "image/ocr-png".

    These are not standard mime types but tika specific types which are dynamically checked for supporting ocr parser availability while parsing images keeping the non ocr standard parsers as fallback. This ensures you don't need to (and one shouldn't...for reasons) set the supported types of ocr parsers to standard mime types instead of OCRed mimetypes

    Then override the parse method to put custom OCR logic

    and creating a new AutodetectParser with the above custom parser put in the registry will do the work

    in case of competition from TesseractOCRParser either uninstall tesseract from host--in which case the parser will only return an empty set for getSupportedTypesMethod, or exclude the parser from config

    one can also manually tweak the registry of AutodetectParser to suit one's needs