javaapache-tikalucee

Why am I unable to extract text via Apache Tika using Lucee?


I would like to extract text from pdf, docx etc via Lucee 5+ (5.2.9), but unfortunately I get empty result set. I have used several Apache Tika versions (runnable jar with Java 1.8.0) that might fit to my specific Lucee and Java requirements, but the result set always remains empty.

exract.cfc

component {
    
    public any function init() {

        _setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.19.1.jar" );

        return this;

    }


    private struct function doParse( required any fileContent, boolean includeMeta=true, boolean includeText=true ) {
        var result  = {};
        var is      = "";
        var jarPath = _getTikaJarPath();

        if ( IsBinary( arguments.fileContent ) ) {
            is = CreateObject( "java", "java.io.ByteArrayInputStream" ).init( arguments.fileContent );
        } else {
            // TODO, support plain string input (i.e. html)
            return {};
        }

        try {
            var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
            var ch     = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
            var md     = CreateObject( "java", "org.apache.tika.metadata.Metadata"      , jarPath ).init();

            parser.parse( is, ch, md );

            if ( arguments.includeMeta ) {
                result.metadata = {};

                for( var key in md.names() ) {
                    var mdval = md.get( key );
                    if ( !isNull( mdval ) ) {
                        result.metadata[ key ] = _removeNonUnicodeChars( mdval );
                    }
                }
            }

            if ( arguments.includeText ) {
                result.text = _removeNonUnicodeChars( ch.toString() );
            }

        } catch( any e ) {
            result = { error = e };
        }

        return result;
}


    public function read(required string filename) {
        var result = {};

        if(!fileExists(filename)) {
            result.error = "#filename# does not exist.";
            return result;
        };

        var f = createObject("java", "java.io.File").init(filename);
        var fis = createObject("java","java.io.FileInputStream").init(f);

        try {
            result = doParse(fis);
        } catch(any e) {
            result.error = e;
        }
        fis.close();
        return result;
    }

    private string function _removeNonUnicodeChars( required string potentiallyDirtyString ) {
        return ReReplace( arguments.potentiallyDirtyString, "[^\x20-\x7E]", "", "all" );
    }

// GETTERS AND SETTERS
    private string function _getTikaJarPath() {
        return _tikaJarPath;
    }
    private void function _setTikaJarPath( required string tikaJarPath ) {
        _tikaJarPath = arguments.tikaJarPath;
}


}

and the code that i use to run it

<cfset takis = new exract()>
<cfset files = directoryList(expandPath("./sources"))>
<cfloop index="f" array="#files#">
    <cfif not findNoCase(".DS_Store",f)>
        <cfdump var="#takis.read(f)#" label="#f#">
    </cfif>
</cfloop>

enter image description here


Solution

  • I think the problem is a class clash: The Lucee core engine already loads a version of Tika meaning the one you point to is ignored. But the loaded version doesn't behave as expected, returning empty strings as you've seen.

    I've solved this by using OSGi to load the desired Tika version. This involves editing the Manifest of the tika-app jar to include basic OSGi metadata and then loading it via my osgiLoader

    There is a pre-built Tika bundle available but I haven't been able to get it to work with Lucee.

    Here's how to convert the latest tika-app jar to OSGi:

    1. open the "tika-app-1.28.2.jar" with 7-zip
    2. open META-INF then select MANIFEST.MF and press F4 to open it in a text editor
    3. add the following to the end of the file:
    Bundle-Name: Apache Tika App Bundle
    Bundle-SymbolicName: apache-tika-app-bundle
    Bundle-Description: Apache Tika App jar converted to an OSGi bundle
    Bundle-ManifestVersion: 2
    Bundle-Version: 1.28.2
    Bundle-ClassPath: .,tika-app-1.28.2.jar
    
    1. Save choosing to update when prompted.

    You can then call the jar using osgiLoader as follows:

    extractor.cfc

    component{
    
        property name="loader" type="object";
        property name="tikaBundle" type="struct";
    
        public extractor function init( required object loader, required struct tikaBundle ){
            variables.loader = arguments.loader
            variables.tikaBundle = arguments.tikaBundle
            return this
        }
    
        public string function parseToString( required string filePath ){
            try{
                var fileStream = CreateObject( "java", "java.io.FileInputStream" ).init( JavaCast( "string", arguments.filePath ) )
                var tikaObject = loader.loadClass( "org.apache.tika.Tika", tikaBundle.path, tikaBundle.name, tikaBundle.version )
                var result = tikaObject.parseToString( fileStream )
            }
            finally{
                fileStream.close()
            }
            return result
        }
    
    }
    

    (The following script assumes extractor.cfc, the modified Tika jar, the osgiLoader.cfc and the document to be processed are in the same directory.)

    index.cfm

    <cfscript>
    docPath = ExpandPath( "test.pdf" )
    loader = New osgiLoader()
    tikaBundle = {
        version: "1.28.2"
        ,name: "apache-tika-app-bundle"
        ,path: ExpandPath( "tika-app-1.28.2.jar" )
    }
    extractor = New extractor( loader, tikaBundle )
    result = extractor.parseToString( docPath )
    dump( result )
    </cfscript>
    

    Another way to get the right version loaded is to use JavaLoader. For some reason I couldn't get it to work with the latest tika-app jar (1.28.2), but 1.19.1 does seem to work.

    Hacking the existing extension

    I would advise you to raise an issue with Preside to change their extension to avoid the clash, but as a temporary hack you could try amending it yourself as follows:

    First, add your modified Tika bundle and the osgiLoader.cfc to the /preside-ext-tika/services/ directory.

    Next, change line 14 of DocumentMetadataService.cfc so the name of the Tika jar path matches your modified bundle.

    _setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.28.2.jar" );
    

    Then, modify lines 33-35 of the same cfc to replace:

    var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
    var ch     = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
    var md     = CreateObject( "java", "org.apache.tika.metadata.Metadata"      , jarPath ).init();
    

    with the following:

    var loader = New osgiLoader();
    var tikaBundle = { version: "1.28.2", name: "apache-tika-app-bundle" };
    
    var parser = loader.loadClass( "org.apache.tika.parser.AutoDetectParser", jarPath, tikaBundle.name, tikaBundle.version )
    var ch     = loader.loadClass( "org.apache.tika.sax.BodyContentHandler" , jarPath, tikaBundle.name, tikaBundle.version ).init(-1)
    var md     = loader.loadClass( "org.apache.tika.metadata.Metadata"      , jarPath, tikaBundle.name, tikaBundle.version ).init()
    

    NB: I don't have Preside so can't test it in context.