javarestapache-sparkspring-data-hadoop

How to submit multiple Spark applications in parallel without spawning separate JVMs?


The problem is that you need to launch separate JVM to create separate session with different number of RAM per job.

How to submit few Spark applications simultaneously without manually spawning separate JVMs?

My app is run on single server, within single JVM. That appears a problem with Spark session per JVM paradigm. Spark paradigm says:

1 JVM => 1 app => 1 session => 1 context => 1 RAM/executors/cores config

I'd like to have different configurations per Spark application without launching extra JVMs manually. Configurations:

  1. spark.executor.cores
  2. spark.executor.memory
  3. spark.dynamicAllocation.maxExecutors
  4. spark.default.parallelism

Usecase

You have started long running job, say 4-5 hours to complete. The job is run within a session with configs spark.executor.memory=28GB, spark.executor.cores=2. Now you want to launch 5-10 seconds job on user demand, without waiting 4-5 hours. This tinny job need 1GB of RAM. What would you do? Submit tinny job from behalf of long-running-job-session? Than it will claim 28GB ((

What I've found

  1. Spark allow you to configure number of CPU and executors only on the session level. Spark scheduling pool allow you to slide and dice only number of cores, not a RAM or executors, right?
  2. Spark Job Server. But they does't support Spark newer than 2.0, not an option for me. But they actually solve the problem for versions older than 2.0. In Spark JobServer features they said Separate JVM per SparkContext for isolation (EXPERIMENTAL), which means spawn new JVM per context
  3. Mesos fine-grained mode is deprecated
  4. This hack, but it's too risky to use it in production.
  5. Hidden Apache Spark REST API for job submission, read this and this. There is definitely way to specify executor memory and cores there, but still what is the behavior on submitting two jobs with different configs? As I understand this is Java REST client for it.
  6. Livy. Not familiar with it, but looks they have Java API only for batch submission, which is not an option for me.

Solution

  • With a use case, this is much clearer now. There are two possible solutions:

    If you require shared data between those jobs, use the FAIR-scheduler and a (REST-)frontend (as does SparkJobServer, Livy, etc.). You don't need to use SparkJobServer either, it should be relatively easy to code, if you have a fixed scope. I've seen projects go in that direction. All you need is an event loop and a way to translate your incoming queries into Spark queries. In a way, I would expect there to be demand for a library to cover this use case, since it's pretty much always the first thing you have to build, when you work on a Spark-based application/framework. In this case, you can size your executors according to your hardware, Spark will manage scheduling of your jobs. With Yarn's dynamic resource allocation, Yarn will also free resources (kill executors), should your framework/app be idle. For more information, read here: http://spark.apache.org/docs/latest/job-scheduling.html

    If you don't need shared data, use YARN (or another resource manager) to assign resources in a fair manner to both jobs. YARN has a fair scheduling mode, and you can set the resource demands per application. If you think this suits you, but you need shared data, then you might want to think about using Hive or Alluxio to provide a data interface. In this case you would run two spark-submits, and maintain multiple drivers in the cluster. Building additional automation around spark-submit can help you make this less annoying and more transparent to end users. This approach is also high-latency, since resource allocation and SparkSession initialization take up a more or less constant amount of time.