javahadoopdistributed-computingdistributed-cache

Read sharded output from Hadoop job from DistributedCache


(the title should be sharded to reflect that Hadoops shards its output across multiple files)

I'm chaining multiple Hadoop jobs together. One of the early jobs generates output that is orders of magnitude smaller than the others, hence I'd like to put it into the DistributedCache. That's one hard part. Here's the code I wrote to do this:

FileSystem fs = FileSystem.get(conf);
Path pathPattern = new Path(distCache, "part-r-[0-9]*");
FileStatus [] list = fs.globStatus(pathPattern);
for (FileStatus status : list) {
    DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}

This works fine on my local machine, and on a virtual cluster I set up. However, unlike in this question, it fails on AWS, citing that the return value of DistributedCache.getCacheFiles() is an empty list.

Essentially, I need to way to programmatically read sharded output from one MR job and put it into the DistributedCache. I can't specify hard filenames, since the number of reducers can change every time the program is run. I don't fully grasp how S3 and HDFS work together, and hence am having a difficult time interacting with the FileSystem to read the sharded output. How can I do this in a way that works on AWS?

For reference, I'm using Hadoop 1.0.x: a combination of 1.0.4 (Four Ubuntu 12.10 virtual machines) and 1.0.3 (AWS).


Solution

  • Turns out it was a simple fix to get things working on AWS:

    FileSystem fs = distCache.getFileSystem(conf);
    

    AWS could then see the shards under that directory, and it executed just fine. I still don't know why this was necessary for AWS to work when the previous code in my question functioned just fine on a standard cluster, but there you have it.