Does groupByKey causes a shuffle of all values across the network, even if they are already co-located within a partition, When we do group by operation in sparkSQL , does it uses groupbykey or it uses aggregateByKey for performance ?
groupByKey
will not shuffle on data if the keys are all co-located within each partition. But that would be a rare case.
groupBy
operation in sparkSQL
is an aggregateByKey
which makes it an aggregation operation. We can define aggregation functions after groupBy
in sparkSQL
. groupBy
simply create one instance of the Aggregation Expressions
for each group and each aggregation and go through the data and keep updating those Expressions