My Spark job on Dataproc 2.0 failed. In the driver log, there were many
ExecutorLostFailure (executor 45 exited caused by one of the running tasks)
Reason: Container from a bad node: ... Container killed on request. Exit code is 137
and
23/11/25 10:38:30 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
What could be the possible causes? How do I fix it?
The exception org.apache.spark.SparkException: Could not find CoarseGrainedScheduler
usually indicates the Spark executor failed or was terminated unexpectedly, and the error Container killed on request. Exit code is 137
usually indicates the executor YARN container was killed by Earlyoom (which is available in 2.0+).
When the worker node as a whole is under memory pressure, Earlyoom will be triggered to select and kill processes to release memory to avoid the node to become unhealthy, and YARN containers are often selected. It can be confirmed in /var/log/earlyoom.log
or in Cloud Logging with
resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name=...
resource.labels.cluster_uuid=...
earlyoom
You might see logs like
process is killed due to memory pressure. /usr/lib/jvm/.../java ... org.apache.spark.executor.YarnCoarseGrainedExecutorBackend
.
In this case, you need to reduce memory pressure for the node, either reduce yarn.nodemanager.resource.memory-mb
so there are more space left for other processes, or increase the worker node memory size.
Note that Container killed on request. Exit code is 137
is usually NOT an indicator of OOM with the container itself. If the container itself is OOM, there should be errors like Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used
. In this case, you might want to consider increasing the spark executor memory and/or memoryOverhead.