I am doing OCR to a PDF file using Apache TIKA Server.
I am interested in the hOCR output, but only succeed to get the output in plain text format.
Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR...
HTTP headers. In this case, I am using the X-Tika-OCRoutputType: hocr
HTTP header, but I get the plain text output or html output without HOCR tags.
I tried both the /tika
and /rmeta
endpoints.
The curl
commands I use:
curl -v -X PUT --data-binary @file.pdf \
"http://tika-server:8081/tika" \
-H "Content-Type: application/pdf" \
-H "X-Tika-OCRoutputType: hocr"
curl -v -X PUT --data-binary @file.pdf \
"http://tika-server:8081/rmeta" \
-H "Content-Type: application/pdf" \
-H "X-Tika-OCRoutputType: hocr"
I also tried setting the Accept
header to text/plain, text/html text/xhtml and text/hocr. None works. The last one gets an error.
I am using:
By inspecting the integration test code of TikaResourceTest
, I realized an HTTP header was missing. The correct command should include the X-Tika-PDFOcrStrategy: ocr_only
HTTP header. See more in the ocr & pdf parser docs
The command would thus be:
curl -v -X PUT \
--data-binary @file.pdf \
-H "Content-Type: application/pdf" \
-H "X-Tika-PDFOcrStrategy: ocr_only" \
-H "X-Tika-OCROutputType: hocr" \
"http://tika-server:8081/tika"