apache-sparkamazon-s3hdfs

Spark transactional write operation using temporary directories


According to this blog at databricks, spark relies on commit protocol classes from Hadoop so if job is not finished because of some failure output directory does not change(partial output files do not occurs).

So my questions are ;

Does spark prevent partial writes to different storages in case of failures (HDFS,S3 etc)?

Is it possible for a different spark jobs to use same temporary location before final write operation ?

Is it possible for a same spark job which is submitted more than once to use same temporary location ?


Solution

  • This is a really interesting problem —and fundamental to how you implement data analytics at a scale where failures are inevitable.

    Hadoop V1 Algorithm

    HDFS: O(1) to commit a task, resilient to failure in task commit. Job commit is ~O(files) with lots of files; if it fails partway through, output status unknown.

    S3: O(data) to commit task, very slow to commit job (O(data) for whole job's output). Lack of atomic rename potentially dangerous.

    Hadoop V2 commit algorithm

    HDFS: O(files) to commit a task, can't handle failure. Job commit is an O(1) touch _SUCCESS call. S3: O(data) to commit a task , can't handle failure, and with a longer COPY operation to commit, chance of task commit failure higher.

    I don't personally think the failure semantics of the V2 algorithm work; both MapReduce and Spark assume a task which fails during the commit process can be repeated...this does not hold here.

    There are some extra details which you don't want to know about, like how the drivers conclude a task has failed, how MapReduce decides that it has partitioned from YARN and so must not commit, but generally it is all down to heartbeats and the assumption that once a task has timed out its not going to resurface. If you are implementing a commit algorithm yourself, make sure that a task committer which has hung until after the entire job has committed will not affect the output

    For object stores:

    The ongoing work is on manifest only table formats, [Apache Iceberg] being the most popular, Apache Hudi Uber's, while Delta Lake is the Databrick's equivalent.