TL;DR
Spark 1.6.1 fails to write a CSV file using Spark CSV 1.4 on a standalone cluster with no HDFS with IOException Mkdirs failed to create file
More details:
I'm working on a Spark 1.6.1 application running it on a standalone cluster using a local filesystem (the machine I'm running on doesn't even have HDFS on it) with Scala. I have this dataframe that I'm trying to save as a CSV file using HiveContext.
This is what I'm running:
exportData.write
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.save("/some/path/here") // no hdfs:/ or file:/ prefix in the path
The Spark CSV that I'm using is 1.4. When running this code I get the following exception:
WARN TaskSetManager:70 - Lost task 4.3 in stage 10.0: java.io.IOException: Mkdirs failed to create file: /some/path/here/_temporary/0
The full stacktrace is:
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The output dir does get created but its empty.
I tried running it using the spark shell, what I did is create a dummy dataframe and then save it using the exact same code to save (also to the same path). It succeeded.
I checked the permissions for the folder I'm writing to and changed it to 777 but basically it still doesn't work when running the Spark job
Googling it suggested:
Does anyone has any idea on what exactly is the problem? And how to overcome it?
Thanks in advance
Ok so I found the problem and I hope this will help others
Apparently the machine I'm running on has hadoop installed on it. When I ran hadoop version
it outputted: Hadoop 2.6.0-cdh5.7.1 which is conflicting to my Spark version
Also, I'm not quite sure if its related or not but I was running spark from root instead of as Spark user which may have caused some permission issues
After matching the hadoop version to our spark (in our case we matched Spark to be cloudera's Spark) and running the code as a Spark user this failure stopped happening