javaapache-sparkamazon-s3spark-structured-streamingcheckpointing

S3 Checkpoint with Structured Streaming


I have tried the suggestions given in the Apache Spark (Structured Streaming) : S3 Checkpoint support

I am still facing this issue. Below is the error i get

17/07/06 17:04:56 WARN FileSystem: "s3n" is a deprecated filesystem 
name. Use "hdfs://s3n/" instead.
Exception in thread "main" java.lang.IllegalArgumentException: 
java.net.UnknownHostException: s3n

I have something like this as part of my code

SparkSession spark = SparkSession
    .builder()
    .master("local[*]")
    .config("spark.hadoop.fs.defaultFS","s3")
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    .config("spark.hadoop.fs.s3n.awsAccessKeyId","<my-key>")
    .config("spark.hadoop.fs.s3n.awsSecretAccessKey","<my-secret-key>")
    .appName("My Spark App")
    .getOrCreate();

and then checkpoint directory is being used like this:

StreamingQuery line = topicValue.writeStream()
   .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")

Any help is appreciated. Thanks in advance!


Solution

  • For checkpointing support of S3 in Structured Streaming you can try following way:

    SparkSession spark = SparkSession
        .builder()
        .master("local[*]")
        .appName("My Spark App")
        .getOrCreate();
    
    spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<my-key>")
    spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "<my-secret-key>")
    

    and then checkpoint directory can be like this:

    StreamingQuery line = topicValue.writeStream()
       .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")
    

    I hope this helps!