apache-sparkamazon-s3

Apache Spark: java.lang.NoClassDefFoundError for software.amazon.awssdk.transfer.s3.progress.TransferListener when reading CSV from S3


I am trying to read a CSV file from S3 using Apache Spark, but I encounter the following error:

java.lang.NoClassDefFoundError: software/amazon/awssdk/transfer/s3/progress/TransferListener
  at java.base/java.lang.Class.forName0(Native Method)
  at java.base/java.lang.Class.forName(Class.java:398)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2625)
  ...

These are the jars I'm using:

C:/spark/jars/iceberg-spark-runtime-3.5_2.12-1.7.1.jar,
C:/spark/jars/hadoop-aws-3.4.0.jar,
C:/spark/jars/aws-java-sdk-core-1.11.999.jar,
C:/spark/jars/aws-java-sdk-s3-1.11.999.jar,
C:/spark/jars/aws-sdk-core-2.17.99.jar,
C:/spark/jars/aws-sdk-s3-2.17.99.jar

Using only AWS SDK v1 (aws-java-sdk-core and aws-java-sdk-s3) causes authentication errors.

Using only AWS SDK v2 (aws-sdk-core and aws-sdk-s3) results in missing TransferListener.

Combining v1 and v2 jars in the spark-shell command, I'm still getting the same NoClassDefFoundError.


Solution

  • You should use spark.jars.packages config variable in the codebase where you are using these functions, not directly modifying the global Spark classpath.