scalaapache-sparkamazon-s3apache-spark-sqlspark-structured-streaming

Spark path style access with fs.s3a.path.style.access property is not working


I am trying to write to an on-prem s3 bucket using s3a and therefore my spark writeStream() API uses path as s3a://test-bucket/. To make sure that spark understands this, I added hadoop-aws-2.7.4.jar and aws-java-sdk-1.7.4.jar in build.sbt and configured hadoop in code as follows -

    spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
    spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

And now I try to write data into my custom s3 endpoint as follows -

val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
      dayofmonth(current_date()) as "day",
      month(current_date()) as "month",
      year(current_date()) as "year",
      column("time"),
      column("quality"),
      column("PM25"))
      .writeStream
      .partitionBy("year", "month", "day")
      .format("csv")
      .outputMode("append")
      .option("path",  "s3a://test-bucket/")

val streamingQuery: StreamingQuery = dataStreamWriter.start()

But it seems like this path-style access enabling is not working and it is still reading bucketname before URL as this -

20/05/01 15:39:02 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com

Can someone let me know if I am missing anything here?


Solution

  • I tracked down this issue and thanks to mazaneicha for the comment. This is done by setting hadoop-aws jar version to 2.8.0 in my build.sbt. Seems like a separate flag fs.s3a.path.style.access was introduced in Hadoop 2.8.0 as I found a JIRA ticket HADOOP-12963 for this issue. And it worked.