apache-sparkscala-spark

Spark broadcasts right dataset from left join, which causes org.apache.spark.sql.execution.OutOfMemorySparkException


Spark broadcasts right dataset from left join, which causes org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize, despite I used settings to disable it - spark.sql.adaptive.autoBroadcastJoinThreshold=-1, spark.sql.adaptive.enabled=false, spark.sql.autoBroadcastJoinThreshold=-1. I found that in some conditions Spark can use broadcast nested loop join anyway, as the only option for join, especially if it is non-equi join https://stackoverflow.com/a/76390114/20321407. I added "MERGEJOIN" hint and enabled debug logging. In the log I found following

DEBUG ExtractEquiJoinKeys: Considering join on: Some((((_1#119114._1.card_number = _2#119115.card) OR (_1#119114._1.card_number = _2#119115.visible_number_on_card)) OR (_1#119114._2.string = _2#119115.visible_number_on_card)))
WARN HintErrorLogger: Hint (strategy=merge) is not supported in the query: no equi-join keys.

The question is: are there any options to avoid broadcast join with OR conditions? Or it is only option to use several joins one by one? Is it join with OR condition might be equi join and not broadcasted?


Solution

  • Seems like the only option to avoid broadcasting is split join conditions to several consecutive joins.