I have a build running with DRIVER_MEMORY_LARGE
, NUM_EXECUTORS_64
, and EXECUTOR_CORES_LARGE
, why is this not enough resources to prevent my job from failing due to executor loss due to OOM?
I know I have some large tasks in my job, but it seems strange these resources aren’t enough to succeed.
It’s important to understand your job’s resource allocation not just on the executor level but on the task level.
As a refresher, many tasks can run inside your executors, typically one task gets one core in the executor to do its work with. This is possible to tune but is not advisable for most jobs.
In a typical setup, executors get two cores to do their work with. Therefore, it can run two tasks per executor.
You also get a standard amount of memory in executors, and this memory is fixed when you start your job. Therefore, your memory doesn’t scale up when you change your core count unless you specifically request it via profile like EXECUTOR_MEMORY_LARGE
Your tasks therefore in the normal case get M / C bytes to run with where M is the GB of memory in the executor and C is the number of cores in the executor.
When you increase C, you decrease the amount of memory per task, which increases your risk of OOM.
Therefore, remove your boost of the number of cores, and your job is more likely to succeed.
In most cases, you don’t want to modify the core counts in executors, only the memory allocated to each executor and the count of executors overall. These two options give you the ability to size your infrastructure to succeed on large tasks (executor memory boost) and run fast enough to meet your schedule requirements (executor count).
Cheers