pdfindexinglucenecfmlrailo

Extract text from a PDF in Railo


Just taking over coding a Railo site (Railo 3.3.4.003) and I want to index a large number of PDFs. However, cfindex only seems to index text docs. I see there is <cfpdf action="extracttext">, but apparently this is not supported in Railo. Can anyone confirm or otherwise? If not is the best option org.apache.pdfbox?


Solution

  • PDFBox will certainly do the job. There's an old version included in the Railo class path, but I found it to be buggy. Instead I would use JavaLoader to load the latest version.

    pdfTextExtractor.cfc

    /* The latest pre-built standalone PDFBox jar file and the javaloader package are assumed to be in the same folder as the following component */
    component{
    
        function init( javaLoaderPath="javaloader.JavaLoader" ){
            if( !server.KeyExists( "_pdfBoxLoader" ) ){
                var paths=[];
                paths.append( GetDirectoryFromPath( GetCurrentTemplatePath() ) & "pdfbox-app-1.8.11.jar" );
                server._pdfBoxLoader=New "#javaLoaderPath#"( paths );
            }
            variables.reader=server._pdfBoxLoader.create( "org.apache.pdfbox.pdmodel.PDDocument" );
            variables.stripper=server._pdfBoxLoader.create( "org.apache.pdfbox.util.PDFTextStripper" );
            return this;
        }
    
        string function extractText( required string pdfPath, numeric startPage=0, numeric endPage=0 ){
            if( Val( startPage ) )
                stripper.setStartPage( startPage );
            if( Val( endPage ) )
                stripper.setEndPage( endPage );
            var pdf=reader.load( pdfPath );
            var text=stripper.getText( pdf );
            reader.close();
            return text;
        }
    
    }
    

    See http://blog.simplicityweb.co.uk/94/migrating-from-coldfusion-to-railo-part-7-pdfs for more detail.

    The above will also work with Lucee, Railo's successor, to which I'd strongly advise migrating.