I intent to use solr's data import handler to create documents from rdbms records. One of the rdbms columns is a pdf/word file path. What I would like to do is parse the file with Tika and save the text result in another field of the above document. My final documents should have rdbms & tika imported data in the same document.
For example
Document fields from db: author, publish_year, e-mail
Document fields from tika: plain_text
Is this possible as a single document type configuration in data import handler or should I do separate data handler imports (sql & tika as separate document types) and then make joins from my queries?
Yes it is. After some trial and error, the following configuration works:
<dataConfig>
<dataSource name="ds-db" driver="org.mariadb.jdbc.Driver" url="jdbc:mysql://localhost:3306/eepyakm?user=root" user="root" password="root"/>
<dataSource name="ds-file" type="BinFileDataSource"/>
<document>
<entity name="supplier" query="select * from suppliers_tmp_view" dataSource="ds-db"
deltaQuery="select id from suppliers_tmp_view where last_modified > '${dataimporter.last_index_time}'"
deltaImportQuery="select * from suppliers_tmp_view where id='${dataimporter.delta.id}'">
<entity name="attachment" dataSource="ds-db"
query="select * from suppliers_tmp_files_view where supplier_tmp_id='${supplier.id}' and path is not null"
deltaQuery="select id,supplier_tmp_id from suppliers_tmp_files_view where last_modified > '${dataimporter.last_index_time}' and path is not null"
parentDeltaQuery="select id from suppliers_tmp_view where id='${attachment.supplier_tmp_id}'">
<field name="path" column="path"/>
<entity name="file" onError="skip" processor="TikaEntityProcessor" url="${attachment.path}" format="text" dataSource="ds-file">
<field column="text"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
What happens is that two different-type datasources work together in a nesting entity configuration. The db datasource gets the filename and the file datasource retrieves the file contents for the Tika processor.