javahadoopmapreducepartitioner

What if a custom partitioner is made to select different partitions for records having the same key?


While learning Hadoop MapReduce, I came across how to create a custom Partitioner class. I understand that we need to define the abstract getPartition method in our class. This method is supposed to return the Partition number (an integer) for the current key-value pair.

Now, the number of partitions will be equal to the number of reduce tasks for the job. What if in a custom partitioner, one writes some logic to select the partition based on the 'value' and not the 'key'? With my understanding, this could mean that records having the same key (but different values) might be processed by different reduce tasks, which is not what we are guaranteed by MapReduce. Isn't this an anomaly? And why do we even need the 'value' argument in getPartition(key, value, numPartitions) method? Please correct my understanding if incorrect.


Solution

  • Partitions can be made based on intermediate(output of mapper before spilling data to disk) key or value. When you partition based on value, two different partitions can have records having same keys.