elasticsearchstormcrawler

How can I send StormCrawler content to multiple Elasticsearch indices, based on host?


I currently have a successful StormCrawler instance crawling about 20 sites, and indexing the content to one Elasticsearch index. Is it possible, either in ES or via StormCrawler, to send each host's content to its own unique content index?


Solution

  • Out of curiosity: why do you need to do that? Having one index per host seems rather wasteful. You can filter the results based on a field like host if you want to provide results for a particular host.

    To answer your question, there is no direct way of doing it currently as the IndexerBolt it connected to one index only. You could declare one IndexerBolt per index you need and add a custom bolt to fan based on the value of the host metadata but this is not dynamic and rather heavy-handed. There could be a way of doing it using pipelines in ES, not sure.