solrdih

Hi I want the file name using filelistentityprocessor and lineentityprocessor


  This is my data-config.xml. I can't use Tika EntityProcessor. Is there any way I can do it with LineEntityProcessor?

I am using solr4.4 to index million of documents . i want the file names and modified time to be indexed as well . But couldnot find the way to do it. In the data-config.xml I am fetching files using filelistentityprocessor and then parsing each and every line using lineentityprocessor.

<dataConfig>
    <dataSource encoding="UTF-8" type="FileDataSource" name="fds"  />

<document>
    <entity
                name="files"
                dataSource="null"
                rootEntity="false"
                processor="FileListEntityProcessor"
                 baseDir="C:/Softwares/PlafFiles/"
                fileName=".*\.PLF"
                recursive="true"


             >
               <field column="fileLastModified" name="last_modified" />



    <entity name="na_04"
            processor="LineEntityProcessor"
            dataSource="fds"
            url="${files.fileAbsolutePath}"
            transformer="script:parseRow23">

         <field column="url" name="Plaf_filename"/>      
        <field column="source"  />
        <field column="pict_id" name="pict_id" />
        <field column="pict_type" name="pict_type" />
        <field column="hierarchy_id" name="hierarchy_id" />
        <field column="book_id" name="book_id" />
         <field column="ciscode" name="ciscode" />
          <field column="plaf_line" />



    </entity>
    </entity>

</document>
</dataConfig>

Solution

  • From the documentation of FileListEntityProcessor:

    The implicit fields generated by the FileListEntityProcessor are fileDir, file, fileAbsolutePath, fileSize, fileLastModified and these are available for use within the entity [..].

    You can move these values into differently named fields by referencing them:

    <field column="file" name="filenamefield" />
    <field column="fileLastModified" name="last_modified" />
    

    This will require that you have a schema.xml that actually allows those two names.

    If you need to use them in another string / manipulate it further before inserting:

    You're already using files.fileAbsolutePath, so by using ${files.file} and ${files.fileLastModified} you should be able to extract the values you want.

    You can modify these values and insert them into a specific field by using the TemplateTransformer and referencing the generated fields:

    <field column="filename" template="file:///${files.file}" />