hadoopmapreducecombiners

When Exactly Combiner is called in MapReduce?


Combiners are made using same class as reducer and mostly same code. But question when exactly it is called before sort and shuffle or before reduce when? If before sort and shuffle i. e., just after mapper then how it will get input as [key, list<values>]? as this is given by sort and shuffle. Now if it is called after sort and shuffle i. e., just before reducer then output to combiner is [key, value] like reducer then how reducer will get input as [key, list<values>]?


Solution

  • Output types of a combiner must match output types of a mapper. Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all.

    If your mapper extends Mapper< K1, V1, K2, V2 > and your reducer extends
    Reducer< K2, V2, K3, V3 >, then the combiner must be an extension of
    Reducer< K2, V2, K2, V2 >.

    Combiner is applied at the same machine as the map operation. Definitely before shuffle.

    As referred to the Hadoop documentation:

    When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.

    http://wiki.apache.org/hadoop/HadoopMapReduce