
Spark Streaming - Refresh Static Data

I have a Spark Streaming job, which when it starts, queries Hive and creates a Map[Int, String] object, which is then used for parts of the calculations the job performs.

The problem I have is that the data in Hive has the potential changes every 2 hours. I would like to have the ability to refresh the static data on a schedule, without having to restart the Spark Job every time.

The initial load of the Map object takes around a 1minute.

Any help is very welcome.


  • You can use a listener. Which will be triggered every time when a job is started for any stream within the spark context. Since your db is updated every two hours there is no harm updating it every-time AFAIK.

    sc.addSparkListener(new SparkListener() {
    override def onSparkListenerJobStart(jobStart: SparkListenerJobStart) {
    //load data that to the map that will be sent to executor