I am trying to read an excel sheet from Amazon S3 and here is the code snippet. But it fails saying file doesn't exist though its there , I checked there is a slash (/) missing from the path.
println(path)
val data = sqlContext.read.
format("com.crealytics.spark.excel").
option("location", s3path).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema","true").
option("addColorColumns", "true").
load(path)
path is correctly printed as :
s3a://AKIAJDDDDDDACNA:A6voquDDDDDqNOUsONDy@my-test/test.xlsx
But why the slash is missing when read by spark? Here is the error message :
Name: java.io.FileNotFoundException
Message: s3a:/AKIAJYDDDDDDNA:A6DDDDDDDDDwqxkRqUQyXqqNOUsONDy@my-test/test.xlsx (No such file or directory)
StackTrace: at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:212)
at java.io.FileInputStream.<init>(FileInputStream.java:152)
at java.io.FileInputStream.<init>(FileInputStream.java:104)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:28)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:345)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at $anonfun$1.apply(<console>:46)
at $anonfun$1.apply(<console>:46)
at time(<console>:36)
Somehow the s3a URL is getting down to java.io.FileInputStream.open()
, which only works with local filesystem files, not HDFS, S3, etc. You will need to track down what is happening there inside com.crealytics.spark.excel
. Welcome to the word of using IDEs to work out what third party libraries get up to :) (IntelliJ IDEA is very good at that BTW, as it can go from a pasted stack trace to the specific source code)
Also: don't put your secrets in your URLs, that's dangerous & something which may get disabled in future for security reasons. Set spark.hadoop.fs.s3a.access.key
and spark.hadoop.fs.s3a.secret.key
in your spark-defaults.conf
.