apache-sparkgzipparquetsnappylzo

Spark SQL - difference between gzip vs snappy vs lzo compression formats


I am trying to use Spark SQL to write parquet file.

By default Spark SQL supports gzip, but it also supports other compression formats like snappy and lzo.

What is the difference between these compression formats?

Update: The recent versions of Spark uses snappy as default compression format.


Solution

  • Just try them on your data.

    lzo and snappy are fast compressors and very fast decompressors, but with less compression, as compared to gzip which compresses better, but is a little slower.

    Update many years later:

    Also try lz4 and zstd.