[SOLVED] getting hocr output from tika-server

getting hocr output from tika-server

I am doing OCR to a PDF file using Apache TIKA Server.

I am interested in the hOCR output, but only succeed to get the output in plain text format.

Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers. In this case, I am using the X-Tika-OCRoutputType: hocr HTTP header, but I get the plain text output or html output without HOCR tags.

I tried both the /tika and /rmeta endpoints.

The curl commands I use:

curl -v -X PUT --data-binary @file.pdf \
     "http://tika-server:8081/tika" \
     -H "Content-Type: application/pdf" \
     -H "X-Tika-OCRoutputType: hocr"

curl -v -X PUT --data-binary @file.pdf \
     "http://tika-server:8081/rmeta" \
     -H "Content-Type: application/pdf" \
     -H "X-Tika-OCRoutputType: hocr"

I also tried setting the Accept header to text/plain, text/html text/xhtml and text/hocr. None works. The last one gets an error.

I am using:

Apache Tika 1.22
Tesseract 4.1.0-3.1.x86_64
RedHat 7

Solution

By inspecting the integration test code of TikaResourceTest, I realized an HTTP header was missing. The correct command should include the X-Tika-PDFOcrStrategy: ocr_only HTTP header. See more in the ocr & pdf parser docs

The command would thus be:

curl -v -X PUT \
     --data-binary @file.pdf \
     -H "Content-Type: application/pdf" \
     -H "X-Tika-PDFOcrStrategy: ocr_only" \
     -H "X-Tika-OCROutputType: hocr" \
     "http://tika-server:8081/tika"