I have read through the tutorials from Apache and Yahoo on DistributedCache. I am still confused about one thing though. Suppose I have a file which I want to be copied to all data nodes. So, I use
DistributedCache.addCacheFile(new URI(hdfsPath),job)
in the job Driver to make the file available. Then, I call DistributedCache.getLocalCacheFiles(job)
inside my Mapper.
Now, I want to create an array on the data node based on the contents of this file so that each time map() runs, it can access the elements of the array. Can I do this? I am confused because if I read the cached file and create the array within the Mapper class, it seems like it would create the array for each new input to the Mapper rather than just once per Mapper. How does this part actually work (i.e., where/when should I create the array)?
There are a few concepts mixed here.
Datanode has nothing to do directly with the DistributedCache. It is concept of the MapReduce layer.
Desire to reuse the same derivative from the cached file between mappers is somwhat contradicts with the functional nature of the MR paradigm. Mappers should be logically independent.
What you want is a kind of optimization which makes sense if preprocessing of cached file for the mappers is relatively expensive
You can do it in some extent by saving the preprocessed data in the some static variable, lazy evaluate it, and set hadoop to reuse virtual machines between tasks. It is not "MR" spirit solution but should work.
Better solution would be to preprocess the cached file to the form, where its consumption by the mapper will be cheap.
Lets assume that all the idea is a kind of optimization - otherwise reading and processing the file for each mapping is just fine.
Can be stated that if preparing the file for each mapper is much cheaper than map processing itself, or much cheaper than mapper run overhead - we are fine.
By form I mean the format of the file, which can be very efficiently converted to the in-memory structure we need. For example - if we need some search in the data - we can store data already sorted. It will save us sorting each time, what, usually much more expensive than sequential reading from the disk
If in your case it is properties in some modest number (let say thousands) I can assume that their reading and initialization is not significant comparing to the single mapper