I would like to extract text from pdf, docx etc via Lucee 5+ (5.2.9), but unfortunately I get empty result set. I have used several Apache Tika versions (runnable jar with Java 1.8.0) that might fit to my specific Lucee and Java requirements, but the result set always remains empty.
exract.cfc
component {
public any function init() {
_setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.19.1.jar" );
return this;
}
private struct function doParse( required any fileContent, boolean includeMeta=true, boolean includeText=true ) {
var result = {};
var is = "";
var jarPath = _getTikaJarPath();
if ( IsBinary( arguments.fileContent ) ) {
is = CreateObject( "java", "java.io.ByteArrayInputStream" ).init( arguments.fileContent );
} else {
// TODO, support plain string input (i.e. html)
return {};
}
try {
var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
var ch = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
var md = CreateObject( "java", "org.apache.tika.metadata.Metadata" , jarPath ).init();
parser.parse( is, ch, md );
if ( arguments.includeMeta ) {
result.metadata = {};
for( var key in md.names() ) {
var mdval = md.get( key );
if ( !isNull( mdval ) ) {
result.metadata[ key ] = _removeNonUnicodeChars( mdval );
}
}
}
if ( arguments.includeText ) {
result.text = _removeNonUnicodeChars( ch.toString() );
}
} catch( any e ) {
result = { error = e };
}
return result;
}
public function read(required string filename) {
var result = {};
if(!fileExists(filename)) {
result.error = "#filename# does not exist.";
return result;
};
var f = createObject("java", "java.io.File").init(filename);
var fis = createObject("java","java.io.FileInputStream").init(f);
try {
result = doParse(fis);
} catch(any e) {
result.error = e;
}
fis.close();
return result;
}
private string function _removeNonUnicodeChars( required string potentiallyDirtyString ) {
return ReReplace( arguments.potentiallyDirtyString, "[^\x20-\x7E]", "", "all" );
}
// GETTERS AND SETTERS
private string function _getTikaJarPath() {
return _tikaJarPath;
}
private void function _setTikaJarPath( required string tikaJarPath ) {
_tikaJarPath = arguments.tikaJarPath;
}
}
and the code that i use to run it
<cfset takis = new exract()>
<cfset files = directoryList(expandPath("./sources"))>
<cfloop index="f" array="#files#">
<cfif not findNoCase(".DS_Store",f)>
<cfdump var="#takis.read(f)#" label="#f#">
</cfif>
</cfloop>
I think the problem is a class clash: The Lucee core engine already loads a version of Tika meaning the one you point to is ignored. But the loaded version doesn't behave as expected, returning empty strings as you've seen.
I've solved this by using OSGi to load the desired Tika version. This involves editing the Manifest of the tika-app jar to include basic OSGi metadata and then loading it via my osgiLoader
There is a pre-built Tika bundle available but I haven't been able to get it to work with Lucee.
Here's how to convert the latest tika-app
jar to OSGi:
Bundle-Name: Apache Tika App Bundle
Bundle-SymbolicName: apache-tika-app-bundle
Bundle-Description: Apache Tika App jar converted to an OSGi bundle
Bundle-ManifestVersion: 2
Bundle-Version: 1.28.2
Bundle-ClassPath: .,tika-app-1.28.2.jar
You can then call the jar using osgiLoader as follows:
extractor.cfc
component{
property name="loader" type="object";
property name="tikaBundle" type="struct";
public extractor function init( required object loader, required struct tikaBundle ){
variables.loader = arguments.loader
variables.tikaBundle = arguments.tikaBundle
return this
}
public string function parseToString( required string filePath ){
try{
var fileStream = CreateObject( "java", "java.io.FileInputStream" ).init( JavaCast( "string", arguments.filePath ) )
var tikaObject = loader.loadClass( "org.apache.tika.Tika", tikaBundle.path, tikaBundle.name, tikaBundle.version )
var result = tikaObject.parseToString( fileStream )
}
finally{
fileStream.close()
}
return result
}
}
(The following script assumes extractor.cfc
, the modified Tika jar, the osgiLoader.cfc
and the document to be processed are in the same directory.)
index.cfm
<cfscript>
docPath = ExpandPath( "test.pdf" )
loader = New osgiLoader()
tikaBundle = {
version: "1.28.2"
,name: "apache-tika-app-bundle"
,path: ExpandPath( "tika-app-1.28.2.jar" )
}
extractor = New extractor( loader, tikaBundle )
result = extractor.parseToString( docPath )
dump( result )
</cfscript>
Another way to get the right version loaded is to use JavaLoader. For some reason I couldn't get it to work with the latest tika-app
jar (1.28.2
), but 1.19.1
does seem to work.
I would advise you to raise an issue with Preside to change their extension to avoid the clash, but as a temporary hack you could try amending it yourself as follows:
First, add your modified Tika bundle and the osgiLoader.cfc
to the /preside-ext-tika/services/
directory.
Next, change line 14 of DocumentMetadataService.cfc
so the name of the Tika jar path matches your modified bundle.
_setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.28.2.jar" );
Then, modify lines 33-35 of the same cfc to replace:
var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
var ch = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
var md = CreateObject( "java", "org.apache.tika.metadata.Metadata" , jarPath ).init();
with the following:
var loader = New osgiLoader();
var tikaBundle = { version: "1.28.2", name: "apache-tika-app-bundle" };
var parser = loader.loadClass( "org.apache.tika.parser.AutoDetectParser", jarPath, tikaBundle.name, tikaBundle.version )
var ch = loader.loadClass( "org.apache.tika.sax.BodyContentHandler" , jarPath, tikaBundle.name, tikaBundle.version ).init(-1)
var md = loader.loadClass( "org.apache.tika.metadata.Metadata" , jarPath, tikaBundle.name, tikaBundle.version ).init()
NB: I don't have Preside so can't test it in context.