hadooppartitioner

Routing Key-Values to the correct Reducer in a Hadoop Secondary Sort


I have three reducers and I need each of them to receive the same key, like so:

GOOG - Reducer 0
AAPL - Reducer 1
VMW - Reducer 2

In the partitioner the getPartition() method should return an int indicating the index of the reducer one of (0,1,2).

The implementation of the getPartition() I have is:

return ((CompositeKey) key).getSymbol().hashCode() % numReduceTasks;

However this is not working here is what I get:

 int numReduceTasks = 3;
 System.out.println("GOOG".hashCode() % numReduceTasks);//output: 0
 System.out.println("AAPL".hashCode() % numReduceTasks);//output: 1
 System.out.println("VMW".hashCode() % numReduceTasks);//output:  1

So in the output files I get

.../part-r-00000

GOOG

.../part-r-00001

AAPL
VMW

.../part-r-00002

<empty>

The question is how do I fix this? i.e. how do I write a partitioner function that will guarantee same keys goes to the same reducer.


Solution

  • The code is working exactly as anyone should expect it to. You are using a hash code, which is random and you can't guarantee that when you %3 that they give distinct values. The only way I would see as a way to do this would be have a series of if statements that makes a deterministic decision:

    if GOOG: return 0
    if AAPL: return 1
    if VMW: return 2
    

    Some advice: going "outside of the box" in MapReduce is a dangerous game. The best way to use MapReduce is to play by the rules and you inherit the benefits. Sometimes it's not always possible, but you should always try!