I have a topology with a spout that emits a tuple to the status stream and is picked up by the StatusUpdaterBolt, which in turn write data to an elasticsearch index.
The spout emits a tuple with a Metadata object that contains certain metadata (eg: crawler
).
This is not being written to the status index.
The config looks something like this:
bolts:
- id: "myspout"
className: com.mycompany.MySpout
parallelism: 8
- id: "status"
className: com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt
parallelism: 4
streams:
- from: "myspout"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
The Metadata object is built like this:
Metadata metadata = new Metadata();
...
metadata.setValue("crawler", "mycrawl");
and then is emitted:
collector.emit(new Values(url, metadata));
Why would the custom properties not get written to the status index?
Versions:
storm: 2.4.0 stormcrawler: 2.8
As per the documentation here: https://github.com/DigitalPebble/storm-crawler/wiki/MetadataTransfer
It's important to specify what fields you want transferred/persisted into the status index. If you don't, it won't get persisted.
In your example:
metadata.persist:
- crawler
Note: If you were using parsefilters to extract Outlinks, you'd also need to include:
metadata.transfer:
- crawler
if you wanted it on new documents generated by outlink identification.