apache-stormapache-storm-topology

How to maintain a distributed HashMap in Apache Storm Cluster


We are having a usecase in Apache Storm , where we need get data from a source system , and then perform some operation on the tuple that is recieved but also want to look up the data in database. But making a Database Call everytime for millions of records is not feasible. So is there a way where we can load a distributed hash map on start up and when the tuple is processed in Bolt or Spout, first lookup this hash map and if the value is not present in the HashMap, then make the Datbase Call and update the corresponding Map which should be accessible across.


Solution

  • There is nothing built in (i.e. without running external services) that would be accessible to the entire topology, since your bolts will likely run in different JVMs or even on different hosts. If you need a distributed cache, look at something like Redis https://redis.io/.

    You might want to look at https://storm.apache.org/releases/2.0.0-SNAPSHOT/State-checkpointing.html, the API should be able to do what you want, and there's support for Redis integration. If you don't need the checkpointing functionality, you can of course also just use Redis directly.