[SOLVED] Driver OOM issue

Driver OOM issue

Why does my Spark job run into OOM (OutOfMemory) errors during the write stage, even though my input CSV is only ~3GB and I'm using two executors with 80GB RAM each?

I've investigated and found that the root cause seems to be a series of joins in a loop, like this:

Dataset<Row> ret = somethingDataFrame;
for (String cmd : cmds) {
    ret = ret.join(processDataset(ret, cmd), "primary_key");
}

Each processDataset(ret, cmd) is fast when run independently. But when I perform many of these joins in a loop (10–20 times), the job becomes much slower and eventually causes OOM during the write phase. The Spark UI doesn't show obvious memory pressure during the joins—so why does this pattern cause such heavy memory usage and eventual failure?

Solution

Adjust Executor Configuration:
- Use more executors by increasing total-executor-cores in spark-submit and setting spark.executor.cores to a reasonable number (typically 3–5 cores per executor). You currently have 14 cores per executor, which is much higher than recommended.
- Allocate more memory per executor with spark.executor.memory.
- Allocate more memory for the driver using the --driver-memory option in spark-submit.
Optimize Spark Partitions:
- Increase the number of shuffle partitions to reduce partition size. You can adjust this with .config("spark.sql.shuffle.partitions", numPartitionsShuffle) in your Spark session.
Monitor Memory Usage:
- Check the PeakExecutionMemory metric in the Spark UI's Stages → Tasks tab to see if any task has unusually high memory usage.
- Check the Agents tab to monitor actual memory consumption for both the driver and executors you can see this answer.
Analyze Execution Plan:
- Use explain() in your code to review the logical and physical plan.
- Join operations, as they can significantly increase memory usage, especially if they generate a large number of duplicate rows.