Until now, I have only used Spark on a Hadoop cluster with YARN as the resource manager. In that type of cluster, I know exactly how many executors to run and how the resource management works. However, know that I am trying to use a Standalone Spark Cluster, I have got a little bit confused. Correct me where I am wrong.
From this article, by default, a worker node uses all the memory of the node minus 1 GB. But I understand that by using SPARK_WORKER_MEMORY
, we can use lesser memory. For example, if the total memory of the node is 32 GB, but I specify 16 GB, Spark worker is not going to use anymore than 16 GB on that node?
But what about executors? Let us say if I want to run 2 executors per node, can I do that by specifying executor memory during spark-submit
to be half of SPARK_WORKER_MEMORY
, and if I want to run 4 executors per node, by specifying executor memory to be the quarter of SPARK_WORKER_MEMORY
?
If so, besides executor memory, I would also have to specify executor cores correctly, I think. For example, if I want to run 4 executors on a worker, I would have to specify executor cores to be the quarter of SPARK_WORKER_CORES
? What happens, if I specify a bigger number than that? I mean if I specify executor memory to be the quarter of SPARK_WORKER_MEMORY
, but executor cores to be only half of SPARK_WORKER_CORES
? Would I get 2 or 4 executors running on that node in that case?
So, I experimented with the Spark Standalone cluster myself a bit, and this is what I noticed.
My intuition that muliple executors can be run inside a worker, by tuning executor cores was indeed correct. Let us say, your worker has 16 cores. Now if you specify 8 cores for executors, Spark would run 2 executors per worker.
How many executors run inside a worker also depend upon the executor memory you specify. For example, if worker memory is 24 GB, and you want to run 2 executors per worker, you cannot specify executor memory to be more than 12 GB.
A worker's memory can be limited when starting a slave by specifing the value for optional parameter--memory
or by changing the value of SPARK_WORKER_MEMORY
. Same with the number of cores (--cores
/SPARK_WORKER_CORES
).
If you want to be able to run multiple jobs on the Standalone Spark cluster, you could use the spark.cores.max
configuration property while doing spark-submit
. For example, like this.
spark-submit <other parameters> --conf="spark.cores.max=16" <other parameters>
So, if your Standalone Spark Cluster allows 64 cores in total, and you give only 16 cores to your program, other Spark jobs could use the remaining 48 cores.