amazon-web-servicespysparkaws-sdk

AWS DefaultCredentialsProvider for fs.s3a.aws.credentials.provider in the aws-sdk version 2


In Pyspark, I use this code so many time:

spark_session = (SparkSession
                  .builder
                  .appName(f"ditto-lander-spark-ingest-etl-{args['etl_job_name']}")
                  .config("spark.hadoop.fs.s3a.aws.credentials.provider",
                          "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
                  .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
                  .getOrCreate())

But I see a warning

"The AWS SDK for Java 1.x entered maintenance mode starting July 31, 2024 and will reach end of support on December 31, 2025"

Looking at the documentation, it seems I should be using the aws-sdk-java v2:

.config("spark.hadoop.fs.s3a.aws.credentials.provider", 
"software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider")

But I get this error:

: java.io.IOException: Class class software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider does not implement AWSCredentialsProvider

This is aws-sdk-java 2.29.25. I basically would like to use a chain for credentials using aws-sdk-java v2 for fs.s3a in Pyspark.


Solution

  • This is fixed in release 3.4.x of Hadoop, so you will need to plan for an upgrade to support this.

    This release upgrade Hadoop’s AWS connector S3A from AWS SDK for Java V1 to AWS SDK for Java V2. This is a significant change which offers a number of new features including the ability to work with Amazon S3 Express One Zone Storage - the new high performance, single AZ storage class.