xmlsolrxml-parsinglucidworks

How can I make Solr follow links while parsing a "Solr XML" file to index the results?


There is a web-accessible file system of thousands and thousands of PDF files that I need to be indexed by Solr (with Lucidworks).

I have an XML file containing data corresponding to each. The XML contains the ID, some simple metadata, and the URL of its corresponding PDF in the file system.

Currently, I am able to format the XML in such a way that Solr reads it and indexes all the metadata I need, including the URL of the PDF.

I would like Solr to, as it's parsing the files, actually follow the URL and index the referenced PDF data along with the XML-supplied metadata. Is this possible?


Solution

  • Your best bet (on pure Solr), would probably be a DataImportHandler with nested entities.

    The external processor would be XPathEntityProcessor and within that, you can put TikaEntityProcessor with appropriate data source. Use the variables to construct/pass URL to the inner entity.

    Remember to mark the outer (XPath) entity as rootEntity=false to ensure that Solr documents are created for the inner entities.