spark-streamingfault-tolerancedata-lossdstream

Spark Streaming fault tolerance on DStream batches


Suppose if a stream is received at time X. Suppose my batch duration is 1 minute. Now my executors are processing the first batch. But this execution takes 3 minutes till X+3. But at X+1 and X+2 we receive other two batches. Does that mean that at X+1 my first batch is lost? Or is it stored in my memory and still being processed?


Solution

  • No data will be lost during the processing in Spark Streaming. Spark will store all the incoming data in memory (and also replicate them to other node for fault-tolerance). After each batch interval the executor schedules a new job to process stored data (micro-batch). While this job is running, another data are being stored in memory for future processing.

    Anyway your example is wrong. As stated in Spark documentation

    For a Spark Streaming application running on a cluster to be stable, the system should be able to process data as fast as it is being received. In other words, batches of data should be processed as fast as they are being generated.

    In layman's terms if you process 1 minute slice of data in 3 minutes, you can't expect it's going to work in the long-term. After a while your app will blow up anyway, because of memory usage for storing incoming data.