javahadoopmapreducepartitioner

Hadoop Custom Partitioner not behaving according to the logic


Based on this example here, this works. Have tried the same on my dataset.

Sample Dataset:

OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;

Consider each line as string, my Mapper output is:

key-> string[2], value-> string.

My Partitioner code:

@Override
public int getPartition(Text key, Text value, int reducersDefined) {

    String keyStr = key.toString();
    if(keyStr == "137176") {
        return 0;
    } else {
        return 1 % reducersDefined;
    }
}

In my data set most id's are 137176. Reducer declared -2. I expect two output files, one for 137176 and second for remaining Id's. I'm getting two output files but, Id's evenly distributed on both the output files. What's going wrong in my program?


Solution

    1. Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class);. If you don't do that, the default HashPartitioner is used.

    2. Change String comparison method from == to .equals(). i.e., change if(keyStr == "137176") { to if(keyStr.equals("137176")) {.
      To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). But perhaps those are equivalent. So, what I suggest is:

      Text KEY = new Text("137176");
      
      @Override
      public int getPartition(Text key, Text value, int reducersDefined) {
          return key.equals(KEY) ? 0 : 1 % reducersDefined;    
      }
      

    Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly.