apache-sparkapache-spark-sqlspark-hive

Spark CSV IOException Mkdirs failed to create file


TL;DR

Spark 1.6.1 fails to write a CSV file using Spark CSV 1.4 on a standalone cluster with no HDFS with IOException Mkdirs failed to create file

More details:

I'm working on a Spark 1.6.1 application running it on a standalone cluster using a local filesystem (the machine I'm running on doesn't even have HDFS on it) with Scala. I have this dataframe that I'm trying to save as a CSV file using HiveContext.

This is what I'm running:

exportData.write
      .mode(SaveMode.Overwrite)
      .format("com.databricks.spark.csv")
      .option("delimiter", ",")
      .save("/some/path/here") // no hdfs:/ or file:/ prefix in the path

The Spark CSV that I'm using is 1.4. When running this code I get the following exception:

WARN  TaskSetManager:70 - Lost task 4.3 in stage 10.0: java.io.IOException: Mkdirs failed to create file: /some/path/here/_temporary/0

The full stacktrace is:

at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
        at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
        at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The output dir does get created but its empty.

I tried running it using the spark shell, what I did is create a dummy dataframe and then save it using the exact same code to save (also to the same path). It succeeded.

I checked the permissions for the folder I'm writing to and changed it to 777 but basically it still doesn't work when running the Spark job

Googling it suggested:

Does anyone has any idea on what exactly is the problem? And how to overcome it?

Thanks in advance


Solution

  • Ok so I found the problem and I hope this will help others

    Apparently the machine I'm running on has hadoop installed on it. When I ran hadoop version it outputted: Hadoop 2.6.0-cdh5.7.1 which is conflicting to my Spark version

    Also, I'm not quite sure if its related or not but I was running spark from root instead of as Spark user which may have caused some permission issues

    After matching the hadoop version to our spark (in our case we matched Spark to be cloudera's Spark) and running the code as a Spark user this failure stopped happening