apache-sparkapache-spark-sqldatasetapache-spark-datasetskew

Spark AQE not helping with dataset skew join


I'm facing a problem in spark where 2 skewed datasets takes too long to join. One(or two) of the datasets has skewed data in it and it's used as the join column.

So I enabled spark AQE in the hope of it might help me with skewed dataset join. However when I checked the sql query metrics they don't suggest AQE is helping me with the skew and some of the partitions are still quite large. And when I check the stage status I found a few long running tasks taking hours to complete.

SQL query run metrics screenshot

I'm quite confused by the behavior of AQE and very surprised to find out that it didn't seem to be helping. Could anyone point out what's wrong here or if I'm missing anything?

btw here are some of my spark configurations:

.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.skewJoin.enabled", "true") \
.config("spark.executor.memory", "32g") \
.config("spark.executor.memoryOverhead", "8g") \
.config("spark.sql.shuffle.partitions", "2000") \

Solution

  • In versions 3.0 to 3.2, AQE skew join optimization is still super rudimentary. If you manually alter the number of partitions then it will be skipped. Likewise, much of AQE will be skipped if you use caching. In 3.3 you can force skew join optimization when you are manually partitioning using config spark.sql.adaptive.forceOptimizeSkewedJoin.