I'm trying to run Gobblin on Google Dataproc but I'm getting this NoSuchMethodError and can't figure out how to solve.
Waiting for job output...
...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
Caused by: java.lang.NoSuchMethodError: org.apache.commons.cli.Option.builder()Lorg/apache/commons/cli/Option$Builder;
at gobblin.runtime.cli.CliOption
...
This same job (contents below) runs nice on my local hadoop setup (on my laptop) but does not on dataproc. Have someone ever attempted running Gobblin on Dataproc?
Here's my gobblin job file:
job.name=kafka2gcs
job.group=gkafka2gcs
job.description=Gobblin job to read messages from Kafka and save as is on GCS
job.lock.enabled=false
kafka.brokers=mykafka:9092
topic.whitelist=mytopic
bootstrap.with.offset=earliest
source.class=gobblin.source.extractor.extract.kafka.KafkaDeserializerSource
kafka.deserializer.type=BYTE_ARRAY
extract.namespace=nskafka2gcs
writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.destination.type=HDFS
mr.job.max.mappers=2
writer.output.format=txt
data.publisher.type=gobblin.publisher.BaseDataPublisher
metrics.enabled=false
fs.uri=file:///.
writer.fs.uri=${fs.uri}
mr.job.root.dir=gobblin
writer.output.dir=${mr.job.root.dir}/out
writer.staging.dir=${mr.job.root.dir}/stg
fs.gs.project.id=my-test-project
data.publisher.fs.uri=gs://my-bucket
state.store.fs.uri=${data.publisher.fs.uri}
data.publisher.final.dir=gobblin/pub
state.store.dir=gobblin/state
And these are the commands I issue for dataproc:
gcloud dataproc clusters create myspark \
--image-version 1.1 \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 10 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 10
gcloud dataproc jobs submit hadoop --cluster=myspark \
--class gobblin.runtime.mapreduce.CliMRJobLauncher \
--jars /opt/gobblin-dist/lib/gobblin-runtime-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-api-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-avro-json-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-codecs-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-core-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-core-base-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-crypto-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-crypto-provider-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-data-management-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metastore-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metrics-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metrics-base-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metadata-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-utility-0.10.0.jar,/opt/gobblin-dist/lib/avro-1.8.1.jar,/opt/gobblin-dist/lib/avro-mapred-1.8.1.jar,/opt/gobblin-dist/lib/commons-lang3-3.4.jar,/opt/gobblin-dist/lib/config-1.2.1.jar,/opt/gobblin-dist/lib/data-2.6.0.jar,/opt/gobblin-dist/lib/gson-2.6.2.jar,/opt/gobblin-dist/lib/guava-15.0.jar,/opt/gobblin-dist/lib/guava-retrying-2.0.0.jar,/opt/gobblin-dist/lib/joda-time-2.9.3.jar,/opt/gobblin-dist/lib/javassist-3.18.2-GA.jar,/opt/gobblin-dist/lib/kafka_2.11-0.8.2.2.jar,/opt/gobblin-dist/lib/kafka-clients-0.8.2.2.jar,/opt/gobblin-dist/lib/metrics-core-2.2.0.jar,/opt/gobblin-dist/lib/metrics-core-3.1.0.jar,/opt/gobblin-dist/lib/metrics-graphite-3.1.0.jar,/opt/gobblin-dist/lib/scala-library-2.11.8.jar,/opt/gobblin-dist/lib/influxdb-java-2.1.jar,/opt/gobblin-dist/lib/okhttp-2.4.0.jar,/opt/gobblin-dist/lib/okio-1.4.0.jar,/opt/gobblin-dist/lib/retrofit-1.9.0.jar,/opt/gobblin-dist/lib/reflections-0.9.10.jar \
--properties mapreduce.job.user.classpath.first=true \
-- -jobconfig gs://my-bucket/gobblin-kafka-gcs.job
I have already tried copying all gobblins lib jars inside /usr/lib/hadoop/lib
on all machines of the dataproc cluster, but it didn't work either.
Any ideas?
gobblin 0.10.0
hadoop 2.7.3
dataproc image 1.1
The Hadoop distribution is probably leaking its version of "commons-cli" into your classpath, and conflicting with the one Gobblin was compiled against. Gobblin appears to depend on commons-cli 1.3.1 and Hadoop 2.7.3 is on 1.2.
Typically if these dependencies come from your own application you'd use something like a Maven shade plugin. If you're building Gobblin from source you could see if it compiles with commons-cli 1.2 or if it's actually a hard dependency.
If commons-cli 1.3.1 is fully backwards compatible, you could try deleting
/usr/lib/hadoop/lib/commons-cli-1.2.jar
on your cluster and adding your own downloaded commons-cli-1.3.1.jar
.