I know that joins with co-partitioned DataFrames are not considered wide transformations. Here are the definitions of wide and narrow transformations from the original paper.
narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD, wide dependencies, where multiple child partitions may depend on it.
Even if the DataFrames are co-partitioned, it doesn’t necessarily mean their corresponding partitions are located on the same node. For example, partition P1 of df1
and P1 of df2
may reside on different nodes. So, during the join, data transfer (e.g., moving P1 of df1
to the node of P1 of df2
) is still necessary. However, this is not considered a shuffle.
I have two questions?
In a shuffle each executor writes out shards for all the other executors. Imagine if we have 100 people and everyone sends a letter to everyone else.
After writing out the shards, every executor has to collect their incoming shards from all the other executors. Like 100 people receiving letters from everyone else.
In contrast, when you have co-partitioned RDDs, just not on the same nodes, each executor only has to fetch one shard. Like 100 people receiving 1 letter each.