excelscalaapache-sparkspark-excel

s3 path printed incorrectly by spark excel reader


I am trying to read an excel sheet from Amazon S3 and here is the code snippet. But it fails saying file doesn't exist though its there , I checked there is a slash (/) missing from the path.

println(path)
val data = sqlContext.read.
    format("com.crealytics.spark.excel").
    option("location", s3path).
    option("useHeader", "true").
    option("treatEmptyValuesAsNulls", "true").
    option("inferSchema","true").
    option("addColorColumns", "true").
    load(path)

path is correctly printed as : s3a://AKIAJDDDDDDACNA:A6voquDDDDDqNOUsONDy@my-test/test.xlsx

But why the slash is missing when read by spark? Here is the error message :

 Name: java.io.FileNotFoundException
    Message: s3a:/AKIAJYDDDDDDNA:A6DDDDDDDDDwqxkRqUQyXqqNOUsONDy@my-test/test.xlsx (No such file or directory)
    StackTrace:   at java.io.FileInputStream.open0(Native Method)
      at java.io.FileInputStream.open(FileInputStream.java:212)
      at java.io.FileInputStream.<init>(FileInputStream.java:152)
      at java.io.FileInputStream.<init>(FileInputStream.java:104)
      at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:28)
      at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
      at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:345)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
      at $anonfun$1.apply(<console>:46)
      at $anonfun$1.apply(<console>:46)
      at time(<console>:36)

Solution

  • Somehow the s3a URL is getting down to java.io.FileInputStream.open(), which only works with local filesystem files, not HDFS, S3, etc. You will need to track down what is happening there inside com.crealytics.spark.excel. Welcome to the word of using IDEs to work out what third party libraries get up to :) (IntelliJ IDEA is very good at that BTW, as it can go from a pasted stack trace to the specific source code)

    Also: don't put your secrets in your URLs, that's dangerous & something which may get disabled in future for security reasons. Set spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key in your spark-defaults.conf.