apache-tikatika-server

Tika Server - Parse without bookmark and image tags


I am extracting text with tika server v1.20.

Tika adds [bookmark: xx] and [image: xx] in the text. I don't want them.

Sample output:

How the Gifted Brain Learns David A. Sousa [image: How the Gifted Brain Learns] Welcome to our Third Annual GATE Family Book Study.

Reproduce:

run server -

java -jar tika-server-1.20.jar -p 5000

PUT http://localhost:5000/tika

Attach file as binary and content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

Input file: http://www.hasd.org/cms_files/resources/website%20book%20study%20how%20the%20brain%20works%20building%20background1.docx

Removing this tags using regex\[(image:|bookmark:).*?\] is proplematic because of cases like this:

[image: **[1].jpg]

How to use tika server and do not produce this tags? If not possible, how to remove them?


Solution

  • Whilst you can override this in Tika by adding a custom DocumentSelector for the EmbeddedDocumentUtil to use in the ParseContext, there is nothing like that in tika-config.xml at the moment, nor on it's command line parameters.

    As an aside, there is a header setting for the Recursive Metadata endpoint coming up in Tika 1.25 which lets you specify the maximum embedded recursion (see blow example). However, as you want the content this doesn't help in your case:

    curl -T test_recursive_embedded.docx --header "maxEmbeddedResources: 0" http://localhost:9998/rmeta
    

    Depending on what part of the content you are looking to process, there is an endpoint that may be what you are looking for. This is the /tika/main endpoint.

    curl -T website\ book\ study\ how\ the\ brain\ works\ building\ background1.docx http://localhost:9998/tika/main --header "Accept: text/plain"
    

    This aims to replicate the Tika App's --text-main function and uses the Boilerplate content handler that focuses on the main content in a file. This therefore doesn't process the embedded images.