stormcrawler

Problem running example topology with storm-crawler 2.3-SNAPSHOT


I'm building SC 2.3-SNAPSHOT from source and generating a project from the archetype. Then I try to run the example Flux topology. Seeds are injected properly. I can see all of them in the ES index with the status DISCOVERED. My problem is that no fetching seems to happen after the injection, so I'm looking for ideas of what to investigate. All the storm components look fine, ES as well. In the logs, I can see this kind of errors for my single worker:

2022-02-28 08:41:48.852 c.d.s.e.p.AggregationSpout I/O dispatcher 13 [ERROR] [spout #2]  Exception with ES query
java.io.IOException: Unable to parse response body for Response{requestLine=POST /status/_search?typed_keys=true&max_concurrent_shard_requests=5&search_type=query_then_fetch&batched_red
uce_size=512&preference=_shards%3A2%7C_local HTTP/1.1, host=http://node-1:9200, response=HTTP/1.1 200 OK}
        at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:2351) [stormjar.jar:?]
        at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:660) [stormjar.jar:?]
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:394) [stormjar.jar:?]
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:388) [stormjar.jar:?]
        at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) [stormjar.jar:?]
        at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) [stormjar.jar:?]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) [stormjar.jar:?]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) [stormjar.jar:?]
        at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) [stormjar.jar:?]
        at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [stormjar.jar:?]
        at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) [stormjar.jar:?]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
Caused by: java.lang.NullPointerException
        at com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout$InProcessMap.containsKey(AbstractQueryingSpout.java:158) ~[stormjar.jar:?]
        at com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout.onResponse(AggregationSpout.java:252) ~[stormjar.jar:?]
        at com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout.onResponse(AggregationSpout.java:63) [stormjar.jar:?]
        at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:2349) [stormjar.jar:?]
        ... 18 more

Solution

  • This was fix recently in https://github.com/DigitalPebble/storm-crawler/commit/88784c1af9a35fd45df3b68ace279a0b73e1e856

    Please git pull and mvn clean install StormCrawler before rebuilding the topology.

    Regarding

    "WARN o.a.s.u.Utils - Topology crawler contains unreachable components "__system" What does it refer to"

    No idea but it shouldn't be a big issue.