scalaapache-sparkparquetspark2.4.4

Extension of compressed parquet file in Spark


In my Spark job, I write a compressed parquet file like this:

df
  .repartition(numberOutputFiles)
  .write
  .option("compression","gzip")
  .mode(saveMode)
  .parquet(avroPath)

Then, my files has this extension : file_name .gz.parquet

How can I have ".parquet.gz" ?


Solution

  • I don't believe you can. File extension is hardcoded in ParquetWrite.scala as concatenation of codec's extension and ".parquet", in that order:

      :
        override def getFileExtension(context: TaskAttemptContext): String = {
          CodecConfig.from(context).getCodec.getExtension + ".parquet"
        }
      :
    

    So, unless you want to change the source and compile your own Spark version, or open a JIRA request against Spark... ;))