apache-sparkdatabricksazure-databricks

Spark Executor OOM Error with Adaptive Query Execution enabled


We have Databricks Spark Job with spark.sql.autoBroadcastJoinThreshold: "-1". After migration from Databricks Runtime 10.4 to 15.4 one of our Spark jobs which uses broadcast hint started to fail with error:

ERROR Executor: Exception in task 2.0 in stage 371.0 (TID 16912)
org.apache.spark.memory.SparkOutOfMemoryError: [EXECUTOR_BROADCAST_JOIN_OOM] There is not enough memory to build the broadcast relation LongToUnsafeRowMap. Relation Size = 1462.4 MiB. Total memory used by this task = 1526.4 MiB. Executor Memory Manager Metrics: onHeapExecutionMemoryUsed = 2.4 GiB, offHeapExecutionMemoryUsed = 0.0 B, onHeapStorageMemoryUsed = 472.5 MiB, offHeapStorageMemoryUsed = 0.0 B. [sparkPlanId: Some(44226)] SQLSTATE: 53200

This job fails regardless resources we use, it fails even with Standard_D8s_v3 worker nodes, which has 32GB RAM. Also before the error we have log message which show that there is enough memory.

INFO MemoryStore: Block broadcast_188 stored as values in memory (estimated size 359.3 KiB, free 24.0 GiB)

Looks like this is Adaptive Query Execution issue, as disabling it solves the problem.

Could anybody advise how to overcome this issue without disabling AQE?


Solution

  • This issue was related to new Databricks feature - executor broadcast join https://kb.databricks.com/python/job-fails-with-not-enough-memory-to-build-the-hash-map-error, so to overcome it needed to disable executor broadcast.

    Using Databricks notebook autocomplete we found class which contains all databricks-related configurations - com.databricks.sql.DatabricksSQLConf. Inspecting this file public members we found setting which disables executor broadcast join - spark.databricks.execution.executorSideBroadcast.enabled. Disable of executor broadcast resolved our problem - no problem with broadcasting anymore and AQE works fine.

    It is too bad that Databricks has a lot of properties which affect query execution, but they are not documented.