[SOLVED] Find compression codec used for an hadoop file

Find compression codec used for an hadoop file

Given a compressed file, written on hadoop platform, in one of the following formats:

Avro
Parquet
SequenceFile

How can I find the compression codec used? Assuming that one of the following compression codecs is used (and there is no file extension in the file name):

Snappy
Gzip (not supported on Avro)
Deflate (not supported on Parquet)

Solution

The Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The command you are looking for is meta. This will show all kinds of metadata, including compressions. You can find an example output here, showing SNAPPY compression.

Please note that the compression algorithm does not have to be the same across the whole file. Different column chunks can use different compressions, therefore there is no single field for the compression codec, but one for each column chunk instead. (A column chunk is the part of a column that belong to one row group.) In practice, however, you will probably find the same compression codec being used for all column chunks.

A similar utility exists for Avro, called avro-tool. I'm not that familiar with it, but it has a getmeta command which should show you the compression codec used.