I'm trying to use an input format from Elephant Bird in my Hadoop Streaming script. In particular, I want to use the LzoInputFormat and eventually the LzoJsonInputFormat (working with Twitter data here). But when I try to do this, I keep getting an error that suggests that the Elephant Bird formats are not valid instances of the InputFormat class.
This is how I'm running the Streaming command:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar \
-libjars /project/hanna/src/elephant-bird/build/elephant-bird-2.2.0.jar \
-D stream.map.output.field.separator=\t \
-D stream.num.map.output.key.fields=2 \
-D map.output.key.field.separator=\t \
-D mapred.text.key.partitioner.options=-k1,2 \
-file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/filter/filterMap.py \
-file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/filter/filterReduce.py \
-file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/data/latinKeywords.txt \
-inputformat com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat \
-input /user/ahanna/lzotest \
-output /user/ahanna/output \
-mapper filterMap.py \
-reducer filterReduce.py \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
And this is the error that I get:
Exception in thread "main" java.lang.RuntimeException: class com.hadoop.mapreduce.LzoTextInputFormat not org.apache.hadoop.mapred.InputFormat
at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1078)
at org.apache.hadoop.mapred.JobConf.setInputFormat(JobConf.java:633)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:707)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:122)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
In sake of compatibility Hadoop supports two ways of writing map/reduce tasks in Java: the "old" one via the interfaces from the org.apache.hadoop.mapred
package and the "new" via the abstract classes from the org.apache.hadoop.mapreduce
package.
You need to know this even if you're using the streaming api since the streaming itself is written using the old approach, so when you want to alter some internals of the streaming mechanism with an external library, you should be sure that this library was written using the old school way too.
This is exactly what happened with you. In common case you would write a wrapper but fortunately Elephant Bird provides an old-styled InputFormat
, so all you need is to replace com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat
with com.twitter.elephantbird.mapred.input.DeprecatedLzoTextInputFormat
.