[SOLVED] How do I get schema / column names from parquet file?

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet

I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension.

How do I get the schema / column names for this file?

Solution

You won't be able to "open" the file using hdfs dfs -text because it's not a text file. Parquet files are written to disk very differently compared to text files.

There are command-line utilities available to perform tasks like the one you are looking for. Open and see the schema, data, metadata, etc.

Check out ktrueda's parquet-tools project.

Also, Cloudera (which supports and contributes heavily to Parquet) has a nice page with examples on usage of hangxie's parquet-tools. An example from that page for your use case:

parquet-tools schema part-m-00000.parquet

Check out the Cloudera page: Using Apache Parquet Data Files with CDH - Parquet File Structure.