apache-sparkamazon-s3hadoopoutputformat

Can't get Spark to use the magic output committer for s3 with EMR


I'm trying to use the magic output committer, But whatever I do I get the default output committer.

INFO FileOutputCommitter: File Output Committer Algorithm version is 10
22/03/08 01:13:06 ERROR Application: Only 1 or 2 algorithm version is supported

This is how I know I'm using it according to Hadoop docs. What am I doing wrong? this is my relevant conf (Using SparkConf()), I tried many others.

  .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
  .set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "10")
  .set("spark.hadoop.fs.s3a.committer.magic.enabled", "true")
  .set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
  .set("fs.s3a.committer.name", "magic")
  .set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
  .set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")

I do not have any other configuration relevant to that. Not in code or conf files (Hadoop or Spark), maybe I should? The pathes I'm writing to starts with s3://. Using Hadoop 3.2.1, Spark 3.0.0 and EMR 6.1.1


Solution

  • So After a lot of reading + stevel comment, I found what I need. I'm using the optimized output committer which is built-in EMR and used by default. The reason I didn't use it at first was that the AWS optimized committer is activated only when it can. Until EMR 6.4.0 it worked only on some conditions but from 6.4.0 it works on every write type txt csv parquet and with rdd datagram and dataset. So I was just needed to update to EMR 6.4.0.

    There was an improvement of 50-60 percent in execution time.

    The optimized committer requeirments.