[SOLVED] Apache Spark: java.lang.NoClassDefFoundError for software.amazon.awssdk.transfer.s3.progress.TransferListener when reading CSV from S3

Apache Spark: java.lang.NoClassDefFoundError for software.amazon.awssdk.transfer.s3.progress.TransferListener when reading CSV from S3

I am trying to read a CSV file from S3 using Apache Spark, but I encounter the following error:

java.lang.NoClassDefFoundError: software/amazon/awssdk/transfer/s3/progress/TransferListener
  at java.base/java.lang.Class.forName0(Native Method)
  at java.base/java.lang.Class.forName(Class.java:398)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2625)
  ...

These are the jars I'm using:

C:/spark/jars/iceberg-spark-runtime-3.5_2.12-1.7.1.jar,
C:/spark/jars/hadoop-aws-3.4.0.jar,
C:/spark/jars/aws-java-sdk-core-1.11.999.jar,
C:/spark/jars/aws-java-sdk-s3-1.11.999.jar,
C:/spark/jars/aws-sdk-core-2.17.99.jar,
C:/spark/jars/aws-sdk-s3-2.17.99.jar

Using only AWS SDK v1 (aws-java-sdk-core and aws-java-sdk-s3) causes authentication errors.

Using only AWS SDK v2 (aws-sdk-core and aws-sdk-s3) results in missing TransferListener.

Combining v1 and v2 jars in the spark-shell command, I'm still getting the same NoClassDefFoundError.

Solution

You should use spark.jars.packages config variable in the codebase where you are using these functions, not directly modifying the global Spark classpath.