apache-sparksparkcore

GroupbyKey on spark dataset


Does groupByKey causes a shuffle of all values across the network, even if they are already co-located within a partition, When we do group by operation in sparkSQL , does it uses groupbykey or it uses aggregateByKey for performance ?


Solution

  • groupByKey will not shuffle on data if the keys are all co-located within each partition. But that would be a rare case.

    groupBy operation in sparkSQL is an aggregateByKey which makes it an aggregation operation. We can define aggregation functions after groupBy in sparkSQL. groupBy simply create one instance of the Aggregation Expressions for each group and each aggregation and go through the data and keep updating those Expressions