javaparsingapache-tikafile-type.doc

issues using apache tika Parser object to parse .doc and .docx file formats


When I try to use org.apache.tika.parser.Parser and DefaultDetector() to detect and parse the .doc and .docx file formats. But I am getting some error (not exception) thrown from Tika jars and that doesn't have any helpful stack trace for me to put here. I can confirm that it is happening for .doc and .docx only. PDF, jpeg, texts are fine. Has anyone come across this problem with .doc and .docx file formats? is there any solution that you have adopted?

My Code is the following:

unzippedBytes = loadUnzippedByteCode(attachment.getContents()); /* This is utility method written using native Java Zip library - returns byte array byte[] */

            /* All the objects below were declared beforehand, but not initialised until now */

            parseContextObj = new ParseContext();
            dObj = new DefaultDetector();
            detectedParser = new AutoDetectParser(dObj);
            context.set(Parser.class, parser);
            OutputStream outputstream = new ByteArrayOutputStream();
            metadata = new Metadata();

            InputStream input = TikaInputStream.get(unzippedBytes, metadata);
            ContentHandler handler = new BodyContentHandler(outputstream);
            detectedParser.parse(input, handler, metadata, parseContextObj); // This is where it is throwing NoSuchMethodError - cannot understand why and also cannot get the stacktrace - using tika 1.10 */ 
            input.close();

The code above was something that I also found in some other SO question and decided to use it for my work. Also, the byte[] that I have used is something that I am receiving from very old struts 1.0 FormFile interface (getFileData() that returns byte[]). I used to have the bullhorn's irex parser to parse, but decided to use Tika for numerous reasons. the byte[] works fine with irex, but has issues whenever I am trying to parse .docx and .doc contents.

The following is the stack trace which I masked certain parts of due to privacy reasons:

2016-01-15 16:21:06,947 [http-apr-80-exec-3] [ERROR] XXXXX.XXXX.XXXXService - java.lang.NoSuchMethodError: org.apache.poi.util.POILogger.log(I[L
java/lang/Object;)V
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.parseRelationshipsPart(PackageRelationshipCollection.java:313)
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:163)
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:131)
        at org.apache.poi.openxml4j.opc.PackagePart.loadRelationships(PackagePart.java:561)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:109)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:80)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:125)
        at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:78)
        at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:245)
        at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:227)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
        at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

I realised that my path has POI jar version 2.5.1 and according to maven central repo I am a dinosaur (seems like) is that possibly why? I am also getting error after putting all these for versions 3.13 and 2.60 for poi artifacts and xmlbeans respectively (suggested by @venkyreddy in that answer).

UPDATE I tried building a new project separately from my original work, and used tika-app-1.10.jar ONLY in my classpath. I also investigated the tika-app-1.10.jar and found out that all the POI dependencies are actually there inluding xmlbeans and 'xml-schema'. After keeping only tika-app-1.10.jar in my classpath, I am getting the following Error (not Exception):

java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader
        at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
        at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:167)
        at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
        at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:59)
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at xxx.xxx.xxx.xxx.xxxxxAttachmentWithTika(xxxService.java:792)

I browsed the package and couldn't find any POIXMLTypeLoader class. is this a known issue? Could someone please respond to me?


Solution

  • Make sure there are no outdated POI jars and use the version of POI which matches the version of Tika that you are trying to use.

    The class POIXMLTypeLoader was added to POI after POI 3.13 was released, so it seems you somehow mix newer versions. Only release POI 3.14-beta1 knows about this class! Make sure you do not include that version somehow.