Why the following application won't print the file contents?
package org.example;
import org.apache.tika.Tika;
import java.io.File;
public class TikaFirstTry {
public static void main(String[] args) throws Exception {
Tika tika = new Tika();
for (String fileName : args){
System.out.println(fileName);
String text = tika.parseToString(new File(fileName));
System.out.println("text is: " + text);
}
}
}
The file foo.txt contains:
pizzaaaaa
The program output is:
C:/Users/me/Desktop/foo.txt
text is:
and no exception is thrown...
my pom contains
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-async-cli</artifactId>
<version>2.7.1-SNAPSHOT</version>
</dependency>
</dependencies>
These are the relevant dependency
sections in pom.xml
which are required to run your example:
<project>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.7.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-async-cli</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>
</dependencies>
</project>
First of all, as @khmarbaise has noticed, your tika-async-cli
dependency version looks faulty. As of 26 February, there are only 2 versions of artifact tika-async-cli
available for download: 2.6.0
and 2.7.0
. The one you've shared is not on the list and mvn install
throws an error when trying to fetch that version from Maven Central.
You need both tika-core
and tika-parsers-*
dependencies to run your example.
You've already included tika-core
since tika-async-cli
includes it as a direct dependency:
$ mvn dependency:tree
# ...
[INFO] +- org.apache.tika:tika-async-cli:jar:2.7.0:compile
[INFO] | +- org.apache.tika:tika-core:jar:2.7.0:compile
[INFO] | | \- commons-io:commons-io:jar:2.11.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.19.0:compile
[INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.19.0:compile
[INFO] | \- org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.19.0:compile
# ...
As @Gagravarr has hinted, one of the tika-parsers-*
was missing in your dependencies
section. Currently these come as 3 separate dependencies:
tika-parsers-standard-package
,tika-parser-scientific-module
,tika-parser-sqlite3-module
.As I understand, this came about with Tika 2.0 (more on that here). For your purposes, tika-parsers-standard-package
seems sufficient.
The https://github.com/apache/tika README somewhat proposes the Maven Configuration but it is unfortunately incomplete.
I suspect you do not see an exception because Tika falls back to an EmptyParser
when parsers are not loaded. It creates an empty XHTML document in the background and such a document has no text content. Hence your code outputs an empty string.