solrapache-tikadataimporthandler

Is there a way for SolR data import handler to get Metadata from RDBMS and related file content from Tika?


I intent to use solr's data import handler to create documents from rdbms records. One of the rdbms columns is a pdf/word file path. What I would like to do is parse the file with Tika and save the text result in another field of the above document. My final documents should have rdbms & tika imported data in the same document.

For example

Document fields from db: author, publish_year, e-mail

Document fields from tika: plain_text

Is this possible as a single document type configuration in data import handler or should I do separate data handler imports (sql & tika as separate document types) and then make joins from my queries?


Solution

  • Yes it is. After some trial and error, the following configuration works:

    <dataConfig>
        <dataSource name="ds-db" driver="org.mariadb.jdbc.Driver" url="jdbc:mysql://localhost:3306/eepyakm?user=root" user="root" password="root"/>
        <dataSource name="ds-file" type="BinFileDataSource"/>
        <document>
            <entity name="supplier" query="select * from suppliers_tmp_view" dataSource="ds-db" 
                    deltaQuery="select id from suppliers_tmp_view where last_modified > '${dataimporter.last_index_time}'"
                    deltaImportQuery="select * from suppliers_tmp_view where id='${dataimporter.delta.id}'">
                 
                <entity name="attachment" dataSource="ds-db" 
                        query="select * from suppliers_tmp_files_view where supplier_tmp_id='${supplier.id}' and path is not null"
                        deltaQuery="select id,supplier_tmp_id from suppliers_tmp_files_view where last_modified > '${dataimporter.last_index_time}' and path is not null"
                        parentDeltaQuery="select id from suppliers_tmp_view where id='${attachment.supplier_tmp_id}'">
                
                    <field name="path" column="path"/>
                    
                    <entity name="file" onError="skip" processor="TikaEntityProcessor"  url="${attachment.path}" format="text" dataSource="ds-file">
                        
                        <field column="text"/>
                    </entity>
                </entity>
            </entity>
        </document>
    </dataConfig>
    

    What happens is that two different-type datasources work together in a nesting entity configuration. The db datasource gets the filename and the file datasource retrieves the file contents for the Tika processor.