apache-sparkrddexactly-once

Does RDD re computation on task failure cause duplicate data processing?


When a particular task fails that causes RDD to be recomputed from lineage (maybe by reading input file again), how does Spark ensure that there is no duplicate processing of data? What if the task that failed had written half of the data to some output like HDFS or Kafka ? Will it re-write that part of the data again? Is this related to exactly once processing?


Solution

  • Output operation by default has at-least-once semantics. The foreachRDD function will execute more than once if there’s worker failure, thus writing same data to external storage multiple times. There’re two approaches to solve this issue, idempotent updates, and transactional updates. They are further discussed in the following sections

    Further reading

    http://shzhangji.com/blog/2017/07/31/how-to-achieve-exactly-once-semantics-in-spark-streaming/