hadoopmapreducepartitioner

How to use Distributed cache in partitioner hadoop?


I am new in hadoop and mapreduce partitioner.I want to write my own partitioner and i need to read a file in partitioner. i have searched many times and i get that i should use distributed cache. this is my question that how can i use distributed cache in my hadoop partitioner? what should i write in my partitioner?

public static class CaderPartitioner extends Partitioner<Text,IntWritable> {

    @Override
    public int getPartition(Text key, IntWritable value, int numReduceTasks) {
        return 0; 
    }
}

Thanks


Solution

  • The easiest way to work this out is to look at the example Partitioners included with hadoop. In this case the one to look at is the TotalOrderPartitioner which reads in a pre-generated file to help direct keys.

    You can find the source code here, and here's gist showing how to use it.

    Firstly you need to tell the partitioner where the file can be found in your mapreduce jobs driver (on HDFS):

    // Define partition file path.
    Path partitionPath = new Path(outputDir + "-part.lst");
    // Use Total Order Partitioner.
    job.setPartitionerClass(TotalOrderPartitioner.class);
    // Generate partition file from map-only job's output.
    TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionPath);
    

    In the TotalOrderPartitioner you'll see that it implements Configurable which gives it access to the configuration so it can get the path to the file on HDFS.

    The file is read in the public void setConf(Configuration conf) method, which will be called when the Partitioner object is created. At this point you can read the file and do whatever set-up you want.

    I would think you can re-use a lot of the code from this partitioner.