apache-spark

Definition of shuffling in spark


I know that joins with co-partitioned DataFrames are not considered wide transformations. Here are the definitions of wide and narrow transformations from the original paper.

narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD, wide dependencies, where multiple child partitions may depend on it.

Even if the DataFrames are co-partitioned, it doesn’t necessarily mean their corresponding partitions are located on the same node. For example, partition P1 of df1 and P1 of df2 may reside on different nodes. So, during the join, data transfer (e.g., moving P1 of df1 to the node of P1 of df2) is still necessary. However, this is not considered a shuffle.

I have two questions?

  1. So, what exactly is a shuffle? I understand that not all network data transfers are considered shuffles.
  2. Which types of data transfers are considered shuffles? Are they only the ones involved in wide transformations?

Solution

  • In a shuffle each executor writes out shards for all the other executors. Imagine if we have 100 people and everyone sends a letter to everyone else.

    After writing out the shards, every executor has to collect their incoming shards from all the other executors. Like 100 people receiving letters from everyone else.

    In contrast, when you have co-partitioned RDDs, just not on the same nodes, each executor only has to fetch one shard. Like 100 people receiving 1 letter each.