To run a script using hadoop streaming - I use a bash script that looks like this -
IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
-input $IP1 -input $IP2
-output $OP
How do I generalize this to a case where I have 20 input directories. One approach is specifying it as
-input $IP1 -input $IP2 -input $IP3 ... -input $IP20
But I would want to know if we can use the shell variables and loops/arrays to get it done like this:
IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
IP_CMD=$IP_CMD"-input $"$ip" "
done
IP_ARRAY=($IP_CMD)
hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
"${IP_ARRAY[@]}"
-output $OP
When I try this, I get an Input path does not exist: hdfs://...
error.
FULL COMMAND THAT I AM USING AS IS...
IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
MAPPER_FILE="map_code.py"
REDUCER="reduce_code.py"
IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
IP_CMD=$IP_CMD"-input $"$ip" "
done
hadoop fs -rm -r -skipTrash $OP
cmd="hadoop jar $HADOOP_JAR_PATH \
-D mapred.reduce.tasks=00 \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=\
org.apache.hadoop.io.compress.GzipCodec \
-file $MAPPER_FILE\
-file $REDUCER \
-mapper $PY $MAPPER_FILE\
-reducer $PY $REDUCER\
-output $OP -cacheFile $DC#ref\
$IP_CMD"
eval $cmd
You could make all the command as a string and after you finished use the eval
command.
In your example-
add to IP_CMD the rest of the command and the use eval $IP_CMD