hadoopamazon-s3distcps3distcp

Hadoop distcp No AWS Credentials provided


I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use is:

hadoop distcp -update s3a://[bucket]/[folder]/[filename] hdfs:///some/path/ -D fs.s3a.awsAccessKeyId=[keyid] -D fs.s3a.awsSecretAccessKey=[secretkey] -D fs.s3a.fast.upload=true

However that acts the same as if the '-D' arguments aren't there.

ERROR tools.DistCp: Exception encountered
java.io.InterruptedIOException: doesBucketExist on [bucket]: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint

I've looked at the hadoop distcp documentation, but can't find a solution there on why this isn't working. I've tried -Dfs.s3n.awsAccessKeyId as a flag which didn't work either. I've read how explicitly passing credentials isn't good practice, so maybe this is just some gentil suggestion to do it some other way?

How is one supposed to pass S3 credentials with distcp? Anyone knows?


Solution

  • It appears the format of credentials flags has changed since the previous version. The following command works:

    hadoop distcp \
      -Dfs.s3a.access.key=[accesskey] \
      -Dfs.s3a.secret.key=[secretkey] \
      -Dfs.s3a.fast.upload=true \
      -update \
      s3a://[bucket]/[folder]/[filename] hdfs:///some/path