3.2.4
JDK 17
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.9.2</version>
</dependency>
<!-- Uses commons-compress lib with version 1.26.1 -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.9.2</version>
</dependency>
<!-- Uses commons-compress lib with version 1.24.0 -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
<version>4.13.2</version>
</dependency>
<!-- Uses commons-compress lib with version 1.24.0 -->
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<scope>test</scope>
<version>1.19.7</version>
</dependency>
String htmlFile = "<!DOCTYPE html><html><head><!-- HTML Codes by Quackit.com --><title></title><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><style>body {background-color:#ffffff;background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}h1{font-family:Arial, sans-serif;color:#000000;background-color:#ffffff;}p {font-family:Georgia, serif;font-size:14px;font-style:normal;font-weight:normal;color:#000000;background-color:#ffffff;}</style></head><body><h1>test</h1><p>test2</p></body></html>";
Tika tika = new Tika();
// this throw exception
text = tika.parseToString(TikaInputStream.get(htmlFile.getBytes()));
// This is not working too - same exception
text = tika.parseToString(new ByteArrayInputStream(htmlFile.getBytes()));
java.lang.NoSuchMethodError: 'org.apache.commons.compress.archivers.tar.TarArchiveEntry org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry()'
at org.apache.tika.detect.zip.TikaArchiveStreamFactory.detect(TikaArchiveStreamFactory.java:293) ~[tika-parser-zip-commons-2.9.2.jar:2.9.2]
at org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:124) ~[tika-parser-zip-commons-2.9.2.jar:2.9.2]
at org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:175) ~[tika-parser-zip-commons-2.9.2.jar:2.9.2]
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:177) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.Tika.parseToString(Tika.java:525) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.Tika.parseToString(Tika.java:495) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.Tika.parseToString(Tika.java:557) ~[tika-core-2.9.2.jar:2.9.2]
at ... -> tika.parseToString(...
I tried to use a different approach to use tika parsing, but I got the same exception as I debugged - the first approach uses this second in tika libraries - I tried this second approach:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(TikaInputStream.get(htmlFile.getBytes()), handler, metadata);
text = handler.toString();
I tried to use Tika version 2.9.1
and everything works fine as months before.
But I want the latest versions without vulnerabilities. So I tried to remove tika-parsers-standard-package
from pom and as I debugged tika libraries, it went further but returned just empty text because of missing parser for HTML-like files(as I debugged in tika libraries), logically. So there is some tika bug in libraries or am I doing something wrong?
So I will post an answer for others with the same issue. This helped me:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.9.2</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.9.2</version>
</dependency>
<!-- By adding this the problem is solved -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
<version>1.26.1</version>
</dependency>
I also investigated why this was happening with the following command:
mvn dependency:tree -Dverbose | grep 'omitted for conflict' | grep 'commons-compress'
Command response in my case:
[INFO] | | | +- (org.apache.commons:commons-compress:jar:1.25.0:compile - omitted for conflict with 1.26.1)
[INFO] | | | \- (org.apache.commons:commons-compress:jar:1.26.1:compile - omitted for conflict with 1.24.0)
[INFO] | | | +- (org.apache.commons:commons-compress:jar:1.25.0:compile - omitted for conflict with 1.24.0)
[INFO] | | \- (org.apache.commons:commons-compress:jar:1.26.1:compile - omitted for conflict with 1.24.0)
It means Tika used version 1.24.0
because of conflicts in transitive dependencies between tika-parsers-standard-package
and junit
/testcontainers
dependencies. By hardcoding commons-compress
library to the 1.26.1
version, all dependencies used 1.26.1
commons-compress
so Tika used the correct version too and is working now as expected.