javaapache-sparkamazon-s3spark-cassandra-connector

badRequest when trying to write cassansdra dataset to S3 bucket using hadoop-aws with Spark Java


I am trying to write spark dataset to S3 using hadoop-aws, but keep getting the AWSBadRequestException. Wondering if anyone has idea on what is going wrong.

versions:

hadoop-aws: 3.3.1 spark: 3.2.1

Here are my code:

        SparkConf sparkConf = new SparkConf();

        sparkConf.setMaster("local[*]")
                .set("spark.hadoop.fs.s3a.access.key", "my access key")
                .set("spark.hadoop.fs.s3a.secret.key", "my secret key")
                .set("spark.hadoop.fs.s3a.endpoint", "s3.us-west-2.amazonaws.com")
                .set(CassandraOptions.SPARK_CONF_HOST, getHost(env))
                .set(CassandraOptions.SPARK_CONF_PORT, getPort(env))
                .set(CassandraOptions.SPARK_CONF_LOCALDC, getDatacenter(env))
                .set(CassandraOptions.TABLE, table)
                .set(CassandraOptions.KEYSPACE, keyspace);
        this.sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();

       // getting Dataset<Row> dataset using sparkSession
      // Dataset<Row> ds = dfr.load().where(filterCondition);
       // ds.show();
       
       String s3Path = "s3a://" + bucket_name + "/" + path;
       ds.write().parquet(s3Path);

Following are the error stack track:

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on s3a://my-test-bucket/data/configuration-migration/cassandra/Oregon/shared_services/orgid_appurl/00Dx0000000JBysEAG: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: SAJPPB0RXB43CJ8Y; S3 Extended Request ID: vMUJ5utvguWgzUmBVGw80FAOPP0OBU9a5QFjRDEbJtaHMSl7qZm4+LZzloflAyzSh3Z6maEX6n8=; Proxy: null), S3 Extended Request ID: vMUJ5utvguWgzUmBVGw80FAOPP0OBU9a5QFjRDEbJtaHMSl7qZm4+LZzloflAyzSh3Z6maEX6n8=:400 Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: SAJPPB0RXB43CJ8Y; S3 Extended Request ID: vMUJ5utvguWgzUmBVGw80FAOPP0OBU9a5QFjRDEbJtaHMSl7qZm4+LZzloflAyzSh3Z6maEX6n8=; Proxy: null)
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:243)
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3286)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4263)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:117)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)

Solution

  • It is aws profile conflicts. Setup multiple aws profiles resolve this issue.