javascalaapache-sparkrddhortonworks-data-platform

Spark read file from S3 using sc.textFile ("s3n://...)


Trying to read a file located in S3 using spark-shell:

scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12

scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    ... etc ...

The IOException: No FileSystem for scheme: s3n error occurred with:

What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?

Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.


Solution

  • Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.

    sc.textFile("s3n://bucketname/Filename") now raises another error:

    java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
    

    The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).

    scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt")
    lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21
    
    scala> lyrics.count
    res1: Long = 9
    

    Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.

    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
    val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
    myRDD.count