apache-sparkapache-spark-sql

Why BroadcastExchange needs more driver memory?


When broadcasting, Spark can fail with the error org.apache.spark.sql.errors.QueryExecutionErrors#notEnoughMemoryToBuildAndBroadcastTableError (Spark 3.2.1):

enter image description here

Why BroadcastExchange needs more driver memory? If what broadcast does is to send data to all workers, why driver memory is a bottleneck?

Thanks.


Solution

  • Unfortunately executor side broadcast joins are not yet supported in Spark (see SPARK-17556). Currently all data of the broadcasted dataset is collected in the driver first to build an in-memory hash table which is then distributed to workers. This can result in high memory pressure on the driver.