javamavenapache-tika

tika.parseToString returns empty string


Why the following application won't print the file contents?

package org.example;

import org.apache.tika.Tika;
import java.io.File;

public class TikaFirstTry {
    public static void main(String[] args) throws Exception {
        Tika tika = new Tika();

        for (String fileName : args){
            System.out.println(fileName);
            String text = tika.parseToString(new File(fileName));
            System.out.println("text is: " + text);
        }
    }
}

The file foo.txt contains:

pizzaaaaa

The program output is:

C:/Users/me/Desktop/foo.txt
text is: 

and no exception is thrown...

my pom contains

<dependencies>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-async-cli</artifactId>
    <version>2.7.1-SNAPSHOT</version>
  </dependency>
</dependencies>

Solution

  • TL;DR

    These are the relevant dependency sections in pom.xml which are required to run your example:

    <project>
      <dependencyManagement>
        <dependencies>
          <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-bom</artifactId>
            <version>2.7.0</version>
            <type>pom</type>
            <scope>import</scope>
          </dependency>
        </dependencies>
      </dependencyManagement>
    
      <dependencies>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-async-cli</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-parsers-standard-package</artifactId>
        </dependency>
      </dependencies>
    </project>
    

    Full answer

    First of all, as @khmarbaise has noticed, your tika-async-cli dependency version looks faulty. As of 26 February, there are only 2 versions of artifact tika-async-cli available for download: 2.6.0 and 2.7.0. The one you've shared is not on the list and mvn install throws an error when trying to fetch that version from Maven Central.

    You need both tika-core and tika-parsers-* dependencies to run your example.

    You've already included tika-core since tika-async-cli includes it as a direct dependency:

    $ mvn dependency:tree
    # ...
    [INFO] +- org.apache.tika:tika-async-cli:jar:2.7.0:compile
    [INFO] |  +- org.apache.tika:tika-core:jar:2.7.0:compile
    [INFO] |  |  \- commons-io:commons-io:jar:2.11.0:compile
    [INFO] |  +- org.apache.logging.log4j:log4j-core:jar:2.19.0:compile       
    [INFO] |  |  \- org.apache.logging.log4j:log4j-api:jar:2.19.0:compile     
    [INFO] |  \- org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.19.0:compile
    # ...
    

    As @Gagravarr has hinted, one of the tika-parsers-* was missing in your dependencies section. Currently these come as 3 separate dependencies:

    As I understand, this came about with Tika 2.0 (more on that here). For your purposes, tika-parsers-standard-package seems sufficient.

    The https://github.com/apache/tika README somewhat proposes the Maven Configuration but it is unfortunately incomplete.

    I suspect you do not see an exception because Tika falls back to an EmptyParser when parsers are not loaded. It creates an empty XHTML document in the background and such a document has no text content. Hence your code outputs an empty string.