I am extracting text with tika server v1.20
.
Tika adds [bookmark: xx] and [image: xx] in the text. I don't want them.
Sample output:
How the Gifted Brain Learns David A. Sousa [image: How the Gifted Brain Learns] Welcome to our Third Annual GATE Family Book Study.
Reproduce:
run server -
java -jar tika-server-1.20.jar -p 5000
PUT http://localhost:5000/tika
Attach file as binary and content-type
: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Removing this tags using regex\[(image:|bookmark:).*?\]
is proplematic because of cases like this:
[image: **[1].jpg]
How to use tika server and do not produce this tags? If not possible, how to remove them?
Whilst you can override this in Tika by adding a custom DocumentSelector for the EmbeddedDocumentUtil to use in the ParseContext, there is nothing like that in tika-config.xml at the moment, nor on it's command line parameters.
As an aside, there is a header setting for the Recursive Metadata endpoint coming up in Tika 1.25 which lets you specify the maximum embedded recursion (see blow example). However, as you want the content this doesn't help in your case:
curl -T test_recursive_embedded.docx --header "maxEmbeddedResources: 0" http://localhost:9998/rmeta
Depending on what part of the content you are looking to process, there is an endpoint that may be what you are looking for. This is the /tika/main endpoint.
curl -T website\ book\ study\ how\ the\ brain\ works\ building\ background1.docx http://localhost:9998/tika/main --header "Accept: text/plain"
This aims to replicate the Tika App's --text-main function and uses the Boilerplate content handler that focuses on the main content in a file. This therefore doesn't process the embedded images.