azureazure-databricks

Driver OOM issue


Why does my Spark job run into OOM (OutOfMemory) errors during the write stage, even though my input CSV is only ~3GB and I'm using two executors with 80GB RAM each?

I've investigated and found that the root cause seems to be a series of joins in a loop, like this:

Dataset<Row> ret = somethingDataFrame;
for (String cmd : cmds) {
    ret = ret.join(processDataset(ret, cmd), "primary_key");
}

Each processDataset(ret, cmd) is fast when run independently. But when I perform many of these joins in a loop (10–20 times), the job becomes much slower and eventually causes OOM during the write phase. The Spark UI doesn't show obvious memory pressure during the joins—so why does this pattern cause such heavy memory usage and eventual failure?


Solution