I was browsing through the Hadoop website and found the following link for hadoop streaming.
https://hadoop.apache.org/docs/current1/streaming.html
But, I am more interested in Hadoop YARN (MRv2) - Streaming command line options.
If someone has the exhaustive list, can you please post it here?
If it is not found, can somebody please tell me if any of the command line options in the following command are illegal.
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name="Streaming wordCount Rating" \
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D map.output.key.field.separator=\t \
-D mapreduce.partition.keycomparator.options=-k2,2nr \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-files mapper2.py,reducer2.py \
-mapper "python mapper2.py" \
-reducer "python reducer2.py" \
-input ${OUT_DIR} \
-output ${OUT_DIR_2} > /dev/null
If you want to see all the Hadoop streaming command line options refer to StreamJob.java - setupOptions():
allOptions = new Options().
addOption(input).
addOption(output).
addOption(mapper).
addOption(combiner).
addOption(reducer).
addOption(file).
addOption(dfs).
addOption(additionalconfspec).
addOption(inputformat).
addOption(outputformat).
addOption(partitioner).
addOption(numReduceTasks).
addOption(inputreader).
addOption(mapDebug).
addOption(reduceDebug).
addOption(jobconf).
addOption(cmdenv).
addOption(cacheFile).
addOption(cacheArchive).
addOption(io).
addOption(background).
addOption(verbose).
addOption(info).
addOption(debug).
addOption(help).
addOption(lazyOutput);
The options related to MapReduce are general options for all MapReduce applications and to see if they are valid look at the mapred-default.xml configuration variables. FYI: this refers to Hadoop 2.8.0 so you might need to find the appropriate XML for your version of Hadoop.