hadoopmapreducecombiners

Do Merge happen first or combiner happen first in MapReduce


Consider a WordCount problem for a MapReduce Program.

Let us consider Mapper Output is as follows: Hello 1 World 1 Hello 1 Hadoop 1 Hello 1 Hadoop 1

It goes to partitioner( we specify for 2 as no of reducer,) Now mapoutput get partition in 2 parts part1 :
Hello 1
Hello 1
Hello 1

Part2: World 1 Hadoop 1 Hadoop 1

Since at reducer: we get input as Hello [1,1,1]

World [1]

Hadoop [1,1]

Please clarify my understanding when this merging of Value happens. for a MapReduce: K1, V1 ->(Mapper o/p) K2, V2 -> (Sort and Shuffle) K3, [V3] -> (reducer o/p) K4, v4

My query is when this merging of value happens, before execution of Combiner or after execution of Combiner(during sort and shuffle). or merging of values happens before giving input to Reducer at reducer level.

Since as per my understanding: Mapper output first goes to in memory when it crosses the threshold of mapreduce.task.io.sort.mb it is spilled to local disk, but before spilling data is sorted by partitions, and within each partition it is sorted by the key after sort Combiner get called to reduce the size. After Mapper completion, spill files get merged and combiner get called depending upon the min.num.spills.for.combine value.

Since, in word count problem the reducer do the accumulation of all values of iterable for each specific key and write the output key and sum of values.

Since Combiner is mini reducer, and we specify the same reducer class for the combiner Job.setCombinerClass(Reduce.class);
then do call of Combiner before merge is worthful during sort and shuffle or my understanding is not proper. Please clarify me


Solution

  • The Mapper start storing the output in a buffer and when the buffer is full, before the data be spilled to the disk the combiner is executed trying to reduce the amount of data.

    The combiner could be executed 0 times (If the amount of mapper output data is less than the buffer size) or 1-N times depending on the amount of data.

    Your process should not depends on the combiner, the combiner is just an optional optimization to reduce the amount of the data to be transferred by the network from the mappers to the reducers.

    The result of a previous combiner call could be combined again with the last data. You need to warranty that the input and the output of the combiner be compatible. And the output of the combiner should be compatible with the input of the reducer.

    The combiner is like a local reducer that combine data for only one Mapper before the data is shuffled and transfer to the reducers.