apache-sparkhadoop-yarnpreemption

how can I increase failure tolerance for spark jobs on yarn? Job failed due to too many preemntions?


How can I increase failure tolerance on yarn? In a busy cluster my job fails due to too many failures. Most of the failures were due to Executor lost base by preemption.


Solution

  • If you have preemption enabled you really should be using the external shuffle service to avoid these issues. There's really not much that can be done otherwise.

    https://issues.apache.org/jira/browse/SPARK-14209 - JIRA talks about.