solrapache-poipdfboxapache-tikacontent-indexing

Can not index pdf file after i updated PDFBox from 1.8 to 2.0.2


I am using PDFBox and tika for content indexing of pdf file. Every thing is working fine with PDFFBox 1.8,But when is updated PDFBox to 2.0.2 then it is giving me below error:

(Thread-62 (HornetQ-client-global-threads-2071379348)) Exception while creating solr doucment for content::Failed to close temporary resources: org.apache.tika.exception.TikaException: Failed to close temporary resources
at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:149)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

at org.hornetq.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:91)
at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:983)
at org.hornetq.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:48)
at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1113)
at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not delete temporary file C:\Users\FILESE~1\AppData\Local\Temp\apache-tika-7918716906396425097.tmp
at org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70)
at org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121)
at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150)
... 18 more

Can you please help me to resolve this issue?

I updated PDFBox to 2.0.2 because of this.

My gradle dependency is :

compile "org.apache.poi:poi:3.8"
compile "org.apache.poi:poi-ooxml:3.8"
compile "org.apache.poi:poi-scratchpad:3.8"
compile "org.apache.pdfbox:pdfbox:2.0.2"

compile 'org.apache.tika:tika-parsers:1.5'
compile 'org.apache.tika:tika-core:1.5'

Here I am using tika 1.5 and this version suports pdfbox 2.0.3. you can see here


Solution

  • You use Tika version 1.5 and claim

    Tika 1.5 supports pdfbox 2.0.3

    This is extremely implausible considering that Tika 1.5 has been released in February 2014 long before there was a PDFBox version 2.x, and PDFBox 2.0.0 in multiple ways is incompatible to its earlier 1.8.x releases.

    You point towards the mvnrepository page for Apache Tika Parsers » 1.5 to support your claim. This page shows:

    Screenshot

    But all this means is that Tika 1.5 has a dependency on PDFBox 1.8.4 and that there now exists a PDFBox version 2.0.3. It does not mean that Tika 1.5 properly functions with PDFBox 2.0.3.

    Looking at the pom file you'll see:

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>1.8.4</version>
    </dependency>
    

    Thus, Tika 1.5 has been developed and compiled with PDFBox 1.8.4. If the PDFBox version numbering is sensible, you can hope for Tika 1.5 properly working with any PDFBox 1.8.x from x == 4 onwards.

    But PDFBox development took the opportunity to overhaul the PDFBox architecture in their 2.0.0 release. Most likely, therefore, no program depending on a 1.x PDFBox version can function with PDFBox 2.x without changes.

    According to the TIKA issue TIKA-1959, Tika can run with PDFBox 2.0.1 since version 1.13.


    To make a long story short, therefore, you need at least version 1.13 if you want to use Tika with PDFBox 2.0.x.