pdfsolrsolrnetdataimporthandler

Solr - Achieve Delta-Import with the FileListEntityProcessor for PDF Files


Solr version :: 6.6.1

I am using the solr to index the PDF files and it is working fine as expected. Now i have a requirement to perform the option of delta-import on the the PDF files. Files which are added recently into the folder should only be processed during the data-import-handler action.

I am not able to locate the example of implementing the delta-import with FileListEntityProcessor.

Please suggest.

data-config.xml file looks like this one.

<dataConfig>
  <dataSource type="BinFileDataSource"/>
  <document>
    <entity name="K1FileEntity" processor="FileListEntityProcessor"
dataSource="null"
            recursive = "true"
            baseDir="\\CLD02\RemoteDepot"
            fileName=".*pdf" rootEntity="false">

            <field column="file" name="id"/>
            <!--<field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />-->
            <field column="fileLastModified" name="lastmodified" />

              <entity name="pdf" processor="TikaEntityProcessor"
onError="skip"
                      url="${K1FileEntity.fileAbsolutePath}" format="text">

                <field column="title" name="title" meta="true"/>
                <field column="dc:format" name="format" meta="true"/>
                <field column="text" name="text"/>

              </entity>
    </entity>
  </document>
</dataConfig> 

Solution

  • As mentioned in the docs:

    delta-import

    For incremental imports and change detection. Only the SqlEntityProcessor supports delta imports.

    So you would need to either: