[SOLVED] Extract text from a PDF in Railo

Extract text from a PDF in Railo

Just taking over coding a Railo site (Railo 3.3.4.003) and I want to index a large number of PDFs. However, cfindex only seems to index text docs. I see there is <cfpdf action="extracttext">, but apparently this is not supported in Railo. Can anyone confirm or otherwise? If not is the best option org.apache.pdfbox?

Solution

PDFBox will certainly do the job. There's an old version included in the Railo class path, but I found it to be buggy. Instead I would use JavaLoader to load the latest version.

pdfTextExtractor.cfc

/* The latest pre-built standalone PDFBox jar file and the javaloader package are assumed to be in the same folder as the following component */
component{

    function init( javaLoaderPath="javaloader.JavaLoader" ){
        if( !server.KeyExists( "_pdfBoxLoader" ) ){
            var paths=[];
            paths.append( GetDirectoryFromPath( GetCurrentTemplatePath() ) & "pdfbox-app-1.8.11.jar" );
            server._pdfBoxLoader=New "#javaLoaderPath#"( paths );
        }
        variables.reader=server._pdfBoxLoader.create( "org.apache.pdfbox.pdmodel.PDDocument" );
        variables.stripper=server._pdfBoxLoader.create( "org.apache.pdfbox.util.PDFTextStripper" );
        return this;
    }

    string function extractText( required string pdfPath, numeric startPage=0, numeric endPage=0 ){
        if( Val( startPage ) )
            stripper.setStartPage( startPage );
        if( Val( endPage ) )
            stripper.setEndPage( endPage );
        var pdf=reader.load( pdfPath );
        var text=stripper.getText( pdf );
        reader.close();
        return text;
    }

}

See http://blog.simplicityweb.co.uk/94/migrating-from-coldfusion-to-railo-part-7-pdfs for more detail.

The above will also work with Lucee, Railo's successor, to which I'd strongly advise migrating.