I had a set up running where I could extract in Solr (8.11.2 with tika 1.27) and get OCR from Tesseract (5.2.0).
To do this i had updated TesseractOCRConfig.properties inside tika-parsers-1.27.jar with
tesseractPath=C:/Tesseract-OCR
tessdataPath=C:/Tesseract-OCR/tessdata/
language=dan
I am now trying to replicate the setup with solr 9.1 (Tika 1.28.4) and same Tesseract installation, the files are getting extracted, but I am not getting any OCR.
In 9.1.0 i am getting the following when extracting a jpg file:
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.jpeg.JpegParser"],
In a setup with 8.11.2 i am getting the following when extracting the same jpg:
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.ocr.TesseractOCRParser",
"org.apache.tika.parser.jpeg.JpegParser"],
Turn of the security manager that is on by default in 9.x, this can be done by setting the environment variable:
SOLR_SECURITY_MANAGER_ENABLED=false
The issue is that org.apache.tika.parser.ocr.TesseractOCRParser
require execution rights on the folder where tesseract is installed.
When determening if TesseractOCRParser should be loaded it checks if it can locate and call Tesseract based on the configuaration, the check
method used to see if it can execute an external parser catches SecurityException
among other exceptions and just returns false without any logging, so there is no sign that something is configured wrong even if you turn up logging.