apache-sparkamazon-s3aws-sdkhadoop3

Problem accessing files from S3 using spark on remote Yarn cluster


I'm trying to import csv files from S3 using spark-shell (val df=spark.read.csv("s3a://xxxxxx") ) the spark-shell client is connected to a remote yarn cluster. It failed with java.lang.VerifyError, however, when I launch spark-shell from the same machine of yarn resourcemanager, it works fine.

Here is the error code :

java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
  org/apache/hadoop/fs/s3a/S3AFileSystem.s3GetFileStatus(Lorg/apache/hadoop/fs/Path;Ljava/lang/String;Ljava/util/Set;)Lorg/apache/hadoop/fs/s3a/S3AFileStatus; @274: invokestatic
  Reason:
    Type 'com/amazonaws/AmazonServiceException' (current frame, stack[2]) is not assignable to 'com/amazonaws/SdkBaseException'
  Current Frame:
    bci: @274
    flags: { }
    locals: { 'org/apache/hadoop/fs/s3a/S3AFileSystem', 'org/apache/hadoop/fs/Path', 'java/lang/String', 'java/util/Set', 'java/lang/String', 'com/amazonaws/AmazonServiceException' }
    stack: { 'java/lang/String', 'java/lang/String', 'com/amazonaws/AmazonServiceException' }

spark-default.conf :

spark.master yarn
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS
spark.hadoop.fs.s3a.server-side-encryption.key xxxxxxxxxxxxxxxxxxxxxxxxxxx
spark.hadoop.fs.s3a.enableServerSideEncryption true
com.amazonaws.services.s3.enableV4 true
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.blockManager.port 20020
spark.driver.port 20020
spark.master.ui.port 4048
spark.ui.port 4041
spark.port.maxRetries 100
spark.yarn.jars hdfs://hdfs-master:4040/spark/jars/*
spark.driver.extraJavaOptions=-Dlog4j.configuration=/usr/local/spark/conf/log4j.properties
spark.executor.extraJavaOptions=-Dlog4j.configuration=/usr/local/spark/conf/log4j.properties
spark.eventLog.enabled  true
spark.eventLog.dir hdfs://hdfs-master:4040/spark-logs
spark.yarn.app.container.log.dir /home/aws_install/hadoop/logdir

.hadooprc

hadoop_add_to_classpath_tools hadoop-aws

Any idea what's the root of the problem ?


Solution

  • hints of classpath problems.

    One problem with that hadooprc change is that it only changes your local environment, not those in the rest of the cluster. But the fact you got as far as org/apache/hadoop/fs/s3a/S3AFileSystem.s3GetFileStatus implies that the S3A jar is being loaded -but the JVM itself is having problems

    Possibly there is two copies of the AWS SDK on the classpath, and so its saying that the AmazonServiceException just raised isn't a subclass of SdkBaseException because of the mixed JARs.