In my Spark job, I write a compressed parquet file like this:
df
.repartition(numberOutputFiles)
.write
.option("compression","gzip")
.mode(saveMode)
.parquet(avroPath)
Then, my files has this extension : file_name .gz.parquet
How can I have ".parquet.gz" ?
I don't believe you can. File extension is hardcoded in ParquetWrite.scala
as concatenation of codec's extension and ".parquet", in that order:
:
override def getFileExtension(context: TaskAttemptContext): String = {
CodecConfig.from(context).getCodec.getExtension + ".parquet"
}
:
So, unless you want to change the source and compile your own Spark version, or open a JIRA request against Spark... ;))