tesseractapache-tikatika-serverhocr

getting hocr output from tika-server


I am doing OCR to a PDF file using Apache TIKA Server.

I am interested in the hOCR output, but only succeed to get the output in plain text format.

Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers. In this case, I am using the X-Tika-OCRoutputType: hocr HTTP header, but I get the plain text output or html output without HOCR tags.

I tried both the /tika and /rmeta endpoints.

The curl commands I use:

curl -v -X PUT --data-binary @file.pdf \
     "http://tika-server:8081/tika" \
     -H "Content-Type: application/pdf" \
     -H "X-Tika-OCRoutputType: hocr"

curl -v -X PUT --data-binary @file.pdf \
     "http://tika-server:8081/rmeta" \
     -H "Content-Type: application/pdf" \
     -H "X-Tika-OCRoutputType: hocr"

I also tried setting the Accept header to text/plain, text/html text/xhtml and text/hocr. None works. The last one gets an error.

I am using:


Solution

  • By inspecting the integration test code of TikaResourceTest, I realized an HTTP header was missing. The correct command should include the X-Tika-PDFOcrStrategy: ocr_only HTTP header. See more in the ocr & pdf parser docs

    The command would thus be:

    curl -v -X PUT \
         --data-binary @file.pdf \
         -H "Content-Type: application/pdf" \
         -H "X-Tika-PDFOcrStrategy: ocr_only" \
         -H "X-Tika-OCROutputType: hocr" \
         "http://tika-server:8081/tika"