apache-sparkpyspark

Shuffle - within same worker node


In spark , there can be multiple worker nodes and each worker node can have multiple executors.

Shuffle is movement of data across partitions. Data movement can happen due to one of the below:

Do we call both (a) and (b) as shuffle? Given (b) happens within same worker node, would the overhead of data movement less?


Solution

  • Follow Spark's documentations, they said: "The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation."

    So the answer is both.