Why does my Spark job run into OOM (OutOfMemory) errors during the write stage, even though my input CSV is only ~3GB and I'm using two executors with 80GB RAM each?
I've investigated and found that the root cause seems to be a series of joins in a loop, like this:
Dataset<Row> ret = somethingDataFrame;
for (String cmd : cmds) {
ret = ret.join(processDataset(ret, cmd), "primary_key");
}
Each processDataset(ret, cmd) is fast when run independently. But when I perform many of these joins in a loop (10–20 times), the job becomes much slower and eventually causes OOM during the write phase. The Spark UI doesn't show obvious memory pressure during the joins—so why does this pattern cause such heavy memory usage and eventual failure?
Adjust Executor Configuration:
Use more executors by increasing total-executor-cores
in spark-submit
and setting spark.executor.cores
to a reasonable number (typically 3–5 cores per executor). You currently have 14 cores per executor, which is much higher than recommended.
Allocate more memory per executor with spark.executor.memory
.
Allocate more memory for the driver using the --driver-memory
option in spark-submit
.
Optimize Spark Partitions:
.config("spark.sql.shuffle.partitions", numPartitionsShuffle)
in your Spark session.Monitor Memory Usage:
Check the PeakExecutionMemory
metric in the Spark UI's Stages → Tasks tab to see if any task has unusually high memory usage.
Check the Agents tab to monitor actual memory consumption for both the driver and executors you can see this answer.
Analyze Execution Plan:
Use explain()
in your code to review the logical and physical plan.
Join operations, as they can significantly increase memory usage, especially if they generate a large number of duplicate rows.