[SOLVED] Parquet compression degradation when upgrading spark

Parquet compression degradation when upgrading spark

I have a spark job that writes data to parquet files with snappy compression. One of the columns in parquet is a repeated INT64.

When upgrading from spark 2.2 with parquet 1.8.2 to spark 3.1.1 with parquet 1.10.1, I witnessed a severe degradation in compression ratio.

For this file for example (saved with spark 2.2) I have the following metadata:

creator:     parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c) 
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"numbers","type":{"type":"array","elementType":"long","containsNull":true},"nullable":true,"metadata":{}}]} 

file schema: spark_schema 
--------------------------------------------------------------------------------
numbers:     OPTIONAL F:1 
.list:       REPEATED F:1 
..element:   OPTIONAL INT64 R:1 D:3

row group 1: RC:186226 TS:163626010 OFFSET:4 
--------------------------------------------------------------------------------
numbers:     
.list:       
..element:    INT64 SNAPPY DO:0 FPO:4 SZ:79747617/163626010/2.05 VC:87158527 ENC:RLE,PLAIN_DICTIONARY ST:[min: 4, max: 1967324, num_nulls: 39883]

Reading it with spark 3.1 and saving it again as parquet, I get the following metadata, and parquet part size increases from 76MB to 124MB:

creator:     parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:       org.apache.spark.version = 3.1.1 
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"numbers","type":{"type":"array","elementType":"long","containsNull":true},"nullable":true,"metadata":{}}]} 

file schema: spark_schema 
--------------------------------------------------------------------------------
numbers:     OPTIONAL F:1 
.list:       REPEATED F:1 
..element:   OPTIONAL INT64 R:1 D:3

row group 1: RC:186226 TS:163655597 OFFSET:4 
--------------------------------------------------------------------------------
numbers:     
.list:       
..element:    INT64 SNAPPY DO:0 FPO:4 SZ:129657160/163655597/1.26 VC:87158527 ENC:RLE,PLAIN_DICTIONARY ST:[min: 4, max: 1967324, num_nulls: 39883]

Note the compression ratio decreased from 2.05 to 1.26.

Tried looking for any configuration that have changed between spark or parquet versions. The only thing I could find is parquet.writer.max-padding changed from 0 to 8MB, but even when changing this configuration back to 0, I get the same results.

Below is the ParquetOutputFormat configuration I have with both setups:

Parquet block size to 134217728
Parquet page size to 1048576
Parquet dictionary page size to 1048576
Dictionary is on
Validation is off
Writer version is: PARQUET_1_0
Maximum row group padding size is 0 bytes
Page size checking is: estimated
Min row count for page size check is: 100
Max row count for page size check is: 10000

I would appreciate any guidance here.

Thanks!

UPDATE

Checked spark 3 with snappy 1.1.2.6 (used by spark 2.2), and the compression ratio looks good. Will further look into this issue, and update on my findings.

Solution

So as updated above, snappy-java 1.1.2.6 resolved my issue. Any version higher than this, results in degraded compression. Also tried the purejava flag, but this results in Exception reading parquet. Will open tickets for snappy-java