java apache-spark apache-spark-standalone

Spark: Job restart and retries

Suppose you have Spark + Standalone cluster manager. You opened spark session with some configs and want to launch SomeSparkJob 40 times in parallel with different arguments.

Questions

How to set reties amount on job failures?
How to restart jobs programmatically on failure? This could be useful if jobs failure due lack of resources. Than I can launch one by one all jobs that require extra resources.
How to restart spark application on job failure? This could be useful if job lack resources even when it's launched simultaneously. Than to change cores, CPU etc configs I need to relaunch application in Standalone cluster manager.

My workarounds

1) I pretty sure the 1st point is possible, since it's possible at spark local mode. I just don't know how to do that in standalone mode.
2-3) It's possible to hand listener on spark context like spark.sparkContext().addSparkListener(new SparkListener() {. But seems SparkListener lacks failure callbacks.

Also there is a bunch of methods with very poor documentation. I've never used them, but perhaps they could help to solve my problem.

spark.sparkContext().dagScheduler().runJob();
spark.sparkContext().runJob()
spark.sparkContext().submitJob()
spark.sparkContext().taskScheduler().submitTasks();
spark.sparkContext().dagScheduler().handleJobCancellation();
spark.sparkContext().statusTracker()

Solution

You can use SparkLauncher and control the flow.

import org.apache.spark.launcher.SparkLauncher;

   public class MyLauncher {
     public static void main(String[] args) throws Exception {
       Process spark = new SparkLauncher()
         .setAppResource("/my/app.jar")
         .setMainClass("my.spark.app.Main")
         .setMaster("local")
         .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
         .launch();
       spark.waitFor();
     }
   }

See API for more details.

Since it creates process you can check the Process status and retry e.g. try following:

public boolean isAlive()

If Process is not live start again, see API for more details.

Hoping this gives high level idea of how we can achieve what you mentioned in your question. There could be more ways to do same thing but thought to share this approach.

Cheers !