Something what we might miss as engineers when we talk about AQE - Adaptive Query Execution
on Spark/Databricks:
If you're using coalesce()
to reduce partitions, AQE won’t touch it. No skew detection. No repartitioning. No optimization. Because coalesce()
do not perform full shuffle (like repartition()
) - it merges existing partitions without redistributing.
This is how data skew can appear quietly after coalesce()
and break or slowdown your jobs. Documentation I found a little bit unclear. AQE will intervene after you do a repartition()
which triggers a full shuffle. Is this correct understanding? Documentation seems unclear on this scenario.
Yes, you are correct on your assertion / question.
Coalesce(n)
does not result in a wide / full shuffle as it were - like repartition(n)
as you state. It is a merge of sorts.
There is therefore no signal to Catalyst for run-time optimizing to see if AQE
could be applied - as there is no full shuffle to be detected, which serves as an optimization, precursor condition for invoking AQE
. JOIN
s and groupBy()
are other precursors.
NB: There is some movement and therefore redistribution of data with a coalesce(n)
.