[SOLVED] How to find the COMPRESSION_CODEC used on a Parquet file at the time of its generation?

How to find the COMPRESSION_CODEC used on a Parquet file at the time of its generation?

Usually in Impala, we use the COMPRESSION_CODEC before inserting data into a table for which the underlying files are in Parquet format.

Commands used to set COMPRESSION_CODEC:

set compression_codec=snappy;
set compression_codec=gzip;

Is it possible to find out the type of compression codec used by doing any kind of operation on the Parquet file?

Solution

One way you can find compression algorithm used by Impala parquet table is via parquet-tools. This utility comes packaged with Cloudera CDH, for example, otherwise trivially built from source.

$ parquet-tools meta <parquet-file>
creator:     impala version 2.13.0-SNAPSHOT (build 100d7da677f2c81efa6af2a5e3a2240199ae54d5)

file schema: schema
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
code:        OPTIONAL BINARY R:0 D:1
description: OPTIONAL BINARY R:0 D:1
value:       OPTIONAL INT32 O:INT_32 R:0 D:1

row group 1: RC:823 TS:20420
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
code:         BINARY GZIP DO:4 FPO:1727 SZ:2806/10130/3.61 VC:823 ENC:RLE,PLAIN_DICTIONARY
description:  BINARY GZIP DO:2884 FPO:12616 SZ:10815/32928/3.04 VC:823 ENC:RLE,PLAIN_DICTIONARY
value:        INT32 GZIP DO:17462 FPO:19614 SZ:3241/4130/1.27 VC:823 ENC:RLE,PLAIN_DICTIONARY

Since generally in Parquet (not through Impala) compression can be set column-by-column, for each parquet row group you will see compression being used against each column stats.