bashshellhadoop-streaming

How to pass multiple input directories to a hadoop command using a loop


To run a script using hadoop streaming - I use a bash script that looks like this -

IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
-input $IP1 -input $IP2 
-output $OP

How do I generalize this to a case where I have 20 input directories. One approach is specifying it as -input $IP1 -input $IP2 -input $IP3 ... -input $IP20

But I would want to know if we can use the shell variables and loops/arrays to get it done like this:

IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
    IP_CMD=$IP_CMD"-input $"$ip" "
done

IP_ARRAY=($IP_CMD)

hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
"${IP_ARRAY[@]}"
-output $OP

When I try this, I get an Input path does not exist: hdfs://... error.

FULL COMMAND THAT I AM USING AS IS...

IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
MAPPER_FILE="map_code.py"
REDUCER="reduce_code.py"

IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
    IP_CMD=$IP_CMD"-input $"$ip" "
done

hadoop fs -rm -r -skipTrash $OP

cmd="hadoop jar $HADOOP_JAR_PATH \
-D mapred.reduce.tasks=00 \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=\
org.apache.hadoop.io.compress.GzipCodec \
-file $MAPPER_FILE\
-file $REDUCER  \
-mapper $PY $MAPPER_FILE\
-reducer $PY $REDUCER\
-output $OP -cacheFile $DC#ref\
$IP_CMD"
eval $cmd

Solution

  • You could make all the command as a string and after you finished use the eval command.

    In your example-
    add to IP_CMD the rest of the command and the use eval $IP_CMD