[SOLVED] Custom Partitioner, without setting number of reducers

Custom Partitioner, without setting number of reducers

Is it must that we have to set number of reducers to use custom partitioner ? Example : Word Count problem, want to get all the stop words count in one partition and remaining words count to go to different partition. If I set number of reducers to two and stop words to go to one partition and others to go to the next partition, it will work, but I am restricting the number of reducers to two(or N ), which I don't want. What is the best approach here? Or I have to calculate and set the number of reducers based on the size of the input to get the best performance?

Solution

Specifying a custom partitioner does not change anything since the number of partitions is provided to the partitioner:

int getPartition(KEY key, VALUE value, int numPartitions)

If you don't set a partitioner then the HashPartitioner is used. Its implementation is trivial:

public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

The design of a custom paritioner is up to you. The main goal of a paritioner is to avoid skews and to evenly distribute the load on a provided number of partitions. For some small job it could be ok to decide to only support two reducers, but if you want your job to scale then you must design you job to run with an arbitrary number of reducers.

Or I have to calculate and set the number of reducers based on the size of the input to get the best performance?

That's always what you have to do, and is unrelated to the usage of a custom partitioner. You MUST set the number of reducers, the default value is 1 and Hadoop won't compute this value for you.

If you want to send stop words to one reducer and other words to the other reducer you can do something like that:

public int getPartition(K key, V value, int numReduceTasks) {
    if (isStopWord(key) {
        return 0;
    } else {
        return ((key.hashCode() & Integer.MAX_VALUE) % (numReduceTasks - 1)) + 1;
    }
 }

However it can easily lead to a large data skew. First reducer will be overloaded and will take much longer than the other reducers to complete. In this case it make no sense to use more than two reducers.

It could be an XY problem. I am not sure that what you are asking is the best way to solve your actual problem.