openshifttesseractapache-tikarhel7tika-server

Tika Server not reading embedded images in PDFs


Hi Tika Server is setup with tesseract but still it is not reading embedded images in PDFs. Tried using the two headers available, but not help.

This is happening for PDF files only. While, OCR works for other file types/images.

Using customized docker container here. Oddly, the same container deployed in another machine works. Is there any possibility of lower level issue?

Update: After comparing logs, it seems OCP is lowercasing the custom HTTP headers like X-Tika..., Postman-Token to x-tika..., postman-token etc. Can anyone help me on what could be the possible issue?


Solution

  • It seems that OCP lowercasing the custom headers are reason for the issue. TikaServer 1.25 does not support case insensitive X-Tika headers.

    I have fixed it in Tika Server 1.26. Ref: https://tika.apache.org/1.26/index.html https://issues.apache.org/jira/browse/TIKA-3320