apache-sparkpysparkdatabricks

Adaptive Query Execution Spark on Databricks with Coalesce


Something what we might miss as engineers when we talk about AQE - Adaptive Query Execution on Spark/Databricks:

If you're using coalesce() to reduce partitions, AQE won’t touch it. No skew detection. No repartitioning. No optimization. Because coalesce() do not perform full shuffle (like repartition()) - it merges existing partitions without redistributing. This is how data skew can appear quietly after coalesce() and break or slowdown your jobs. Documentation I found a little bit unclear. AQE will intervene after you do a repartition() which triggers a full shuffle. Is this correct understanding? Documentation seems unclear on this scenario.

enter image description here


Solution

  • Yes, you are correct on your assertion / question.

    Coalesce(n) does not result in a wide / full shuffle as it were - like repartition(n) as you state. It is a merge of sorts.

    There is therefore no signal to Catalyst for run-time optimizing to see if AQE could be applied - as there is no full shuffle to be detected, which serves as an optimization, precursor condition for invoking AQE. JOINs and groupBy() are other precursors.

    NB: There is some movement and therefore redistribution of data with a coalesce(n).