web-crawlerapache-stormstormcrawler

Emit a custom metadata from seed URLs through all child discovered URLs for all depth


I have a Storm Crawler based project which index all contents and status in Solr collections. For each seedUrl, I have some metadata which needs to be emitted through all child of each seed URLs. For example, I have a data structure similar to this:

<crawlId, seedUrl, myMetadata>

How can I emit the crawlId and corresponding metaData to all discovered children for each seedUrl? Is there available any built-in capability that we can use from that or not?


Solution

  • metadata.transfer is what you need, see conf from the archetype