I'm new to spark. I'm trying to run a spark job that loads data to elasticsearch. I've built a fat jar from my code and used it during spark-submit
.
spark-submit \
--class CLASS_NAME \
--master yarn \
--deploy-mode cluster \
--num-executors 20 \
--executor-cores 5 \
--executor-memory 32G \
--jars EXTERNAL_JAR_FILES \
PATH_TO_FAT_JAR
The maven dependency of elasticsearch-hadoop
dependency is:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>5.6.10</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
</exclusion>
</exclusions>
</dependency>
When I don't include the elasticsearch-hadoop
jar file in the EXTERNAL_JAR_FILES
list, then I'm getting this error.
java.lang.ExceptionInInitializerError
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.spark.rdd.CompatUtils
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at org.elasticsearch.hadoop.util.ObjectUtils.loadClass(ObjectUtils.java:73)
... 26 more
If I include it in the EXTERNAL_JAR_FILES
list, I'm getting this error.
java.lang.Error: Multiple ES-Hadoop versions detected in the classpath; please use only one
jar:file:PATH_TO_CONTAINER/__app__.jar
jar:file:PATH_TO_CONTAINER/elasticsearch-hadoop-5.6.10.jar
at org.elasticsearch.hadoop.util.Version.<clinit>(Version.java:73)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:572)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:97)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:97)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Is there anything that needs to be done to overcome it?
The problem is solved by not including the elasticserach-hadoop
jar in the fat jar I've built. I've mentioned scope
param to provided
in the dependency.
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>5.6.10</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
</exclusion>
</exclusions>
<scope>provided</scope>
</dependency>