I am trying to write spark dataset to S3 using hadoop-aws, but keep getting the AWSBadRequestException
. Wondering if anyone has idea on what is going wrong.
versions:
hadoop-aws: 3.3.1 spark: 3.2.1
Here are my code:
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local[*]")
.set("spark.hadoop.fs.s3a.access.key", "my access key")
.set("spark.hadoop.fs.s3a.secret.key", "my secret key")
.set("spark.hadoop.fs.s3a.endpoint", "s3.us-west-2.amazonaws.com")
.set(CassandraOptions.SPARK_CONF_HOST, getHost(env))
.set(CassandraOptions.SPARK_CONF_PORT, getPort(env))
.set(CassandraOptions.SPARK_CONF_LOCALDC, getDatacenter(env))
.set(CassandraOptions.TABLE, table)
.set(CassandraOptions.KEYSPACE, keyspace);
this.sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
// getting Dataset<Row> dataset using sparkSession
// Dataset<Row> ds = dfr.load().where(filterCondition);
// ds.show();
String s3Path = "s3a://" + bucket_name + "/" + path;
ds.write().parquet(s3Path);
Following are the error stack track:
Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on s3a://my-test-bucket/data/configuration-migration/cassandra/Oregon/shared_services/orgid_appurl/00Dx0000000JBysEAG: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: SAJPPB0RXB43CJ8Y; S3 Extended Request ID: vMUJ5utvguWgzUmBVGw80FAOPP0OBU9a5QFjRDEbJtaHMSl7qZm4+LZzloflAyzSh3Z6maEX6n8=; Proxy: null), S3 Extended Request ID: vMUJ5utvguWgzUmBVGw80FAOPP0OBU9a5QFjRDEbJtaHMSl7qZm4+LZzloflAyzSh3Z6maEX6n8=:400 Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: SAJPPB0RXB43CJ8Y; S3 Extended Request ID: vMUJ5utvguWgzUmBVGw80FAOPP0OBU9a5QFjRDEbJtaHMSl7qZm4+LZzloflAyzSh3Z6maEX6n8=; Proxy: null)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:243)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3286)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760)
at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4263)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:117)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
It is aws profile conflicts. Setup multiple aws profiles resolve this issue.