I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain"
header) from our application.
Since Tika 1.15, the default behaviour is now to "extract all embedded documents" TIKA-2096.
I want to be able to turn this behaviour off on our tika-server so that embedded documents are NOT extracted and I only get the text rendition of the main document contents.
Is it possible to do this via a tika-config.xml
file, or do I need to do a custom build and subclass EmbeddedDocumentExtractor
so that it doesn't do anything?
An answer to tika-parser-exclude-pdf-attachments indicates that you can turn this behaviour off by subclassing EmbeddedDocumentExtractor
, but I'd like to check if it's possible to do this via tika-config.xml
without having to do a custom build of the tika-server.
I have looked at Configuring Tika but there is no mention of embedded docs here.
The answers in tika-parser-exclude-pdf-attachments are excellent for if you are calling Tika via code.
Previously there hasn't been a way to do this for embedded files in Tika Server, other than disabling the whole file type using EmptyParser with something like the below:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.EmptyParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/zip</mime-exclude>
</parser>
</parsers>
</properties>
This has become a common request, so I've added a feature coming up in Tika 1.25 (yet to be released) to allow for the skipping embedded files using a header setting:
curl -T test_recursive_embedded.docx http://localhost:9998/tika --header "Accept: text/html" --header "X-Tika-Skip-Embedded: true"
Any parser using the EmbeddedDocumentExtractor will honour this.