apacheconfigurationocrtesseracttika-server

How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?


I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly. When I try to send a pdf with an image on it I get the following.

WARNING: Tesseract OCR is installed and will be automatically applied to image f iles unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.

Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.

There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.

Any help, greatly appreciated.


Solution

  • OK so with the help of this post on the Apache Tika Forum Thank you guys.

    I managed to get it working. Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this

    extractInlineImages true 
    extractUniqueInlineImagesOnly false 
    ocrStrategy ocr_and_text_extraction
    

    Then locate TesseractOCRConfig.properties. And change this one property to 1..

    enableImageProcessing=1
    

    Save the above properties files. Zip it all up again. And use your new zipped up jar file and it will now extract text and text from images from a pdf file.