apache-tikatika-server

Apache Tika Server - Request Header Parameters?


The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:

$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"

From a lot of different documents about tika I found these documented additional header parameters:

X-Tika-OCRLanguage: eng
X-Tika-PDFextractInlineImages: true | false
X-Tika-PDFOcrStrategy: ocr_only  |  ocr_and_text_extraction
X-Tika-OCRoutputType: hocr

But there seems to be no documentation about how to use the X-Tika-.....? header parameters or which parameters are supported and which not.

For example I wonder if it is possible to overwrite the ImageType mode or the DPI with something like:

X-Tika-PDFocrImageType: rgb
X-Tika-PDFocrDPI: 100

My question is: Which header parameters are supported and which naming convention did these params follow?


Solution

  • The code that handles the X-Tika-OCR and X-Tika-PDF headers is TikaResource.processHeaderConfig.

    Those header suffixes and values are then mapped onto the TesseractOCRConfig and PDFParserConfig configuration objects via reflection.

    So, to see what X-Tika headers you can set, look up the options on the config class you want to tweak things on (Tesseract or PDF), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.

    For eg setExtractInlineImages on PDF, that maps to X-Tika-PDFextractInlineImages