I have developed a Hadoop application that uses distributed cache. I used Hadoop 2.9.0. Everything works fine in stand-alone and pseudo-distributed mode.
Driver:
public class MyApp extends Configured implements Tool{
public static void main(String[] args) throws Exception{
if(args.length < 2) {
System.err.println("Usage: Myapp -files cache.txt <inputpath> <outputpath>");
System.exit(-1);
}
int res = ToolRunner.run(new Configuration(), new IDS(), args);
System.exit(res);
...
Mapper:
public class IDSMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void setup(Context context) throws IOException {
BufferedReader bfr = new BufferedReader(new FileReader(new File("cache.txt")));
Starting: sudo bin/hadoop jar MyApp.jar -files cache.txt /input /output
Now I need to measure execution time on a real Hadoop cluster. Unfortunately, I have Hadoop cluster with Hadoop 1.2.1 version on my disposal. So I created new Eclipse project, referenced appropriate Hadoop 1.2.1 jar files, and evertything works fine in stand-alone mode. However, pseudo-distributed mode with Hadoop 1.2.1 fails with an FileNotFoundException in Mapper class (setup method), when trying to read distributed cache file.
Do I have to handle distributed cache files in some other way in Hadoop 1.2.1 ?
Problem was in the run method. I used Job.getInstance method without parameters, and I should use it this way:
Job job = Job.getInstance(getConf());
I still don't know why Hadoop 2.9.0 works with just:
Job job = Job.getInstance();
but getConf solved my problems.