
How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly. When I try to send a pdf with an image on it I get the following.

WARNING: Tesseract OCR is installed and will be automatically applied to image f iles unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.

Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.

There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.

Any help, greatly appreciated.


  • OK so with the help of this post on the Apache Tika Forum Thank you guys.

    I managed to get it working. Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this

    extractInlineImages true 
    extractUniqueInlineImagesOnly false 
    ocrStrategy ocr_and_text_extraction

    Then locate TesseractOCRConfig.properties. And change this one property to 1..


    Save the above properties files. Zip it all up again. And use your new zipped up jar file and it will now extract text and text from images from a pdf file.