scalaapache-spark

Difference between ReduceByKey and CombineByKey in Spark


Is there any difference between ReduceByKey and CombineByKey when it comes to performance in Spark. Any help on this is appreciated.


Solution

  • Reduce by key internally calls combineBykey. Hence the basic way of task execution is same for both.

    The choice of CombineByKey over reduceBykey is when the input Type and output Type is not expected to be the same. So combineByKey will have a extra overhead of converting one type to another .

    If the type conversion is omitted there is no difference at all .