parquet

Inspect Parquet from command line


How do I inspect the content of a Parquet file from the command line?

The only option I see now is

$ hadoop fs -get my-path local-file
$ parquet-tools head local-file | less

I would like to

  1. avoid creating the local-file and
  2. view the file content as json rather than the typeless text that parquet-tools prints.

Is there an easy way?


Solution

  • You can use parquet-tools with the command cat and the --json option in order to view the files without a local copy and in the JSON format.

    Here is an example:

    parquet-tools cat --json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet

    This prints out the data in JSON format:

    {"name":"gil","age":48,"city":"london"}
    {"name":"jane","age":30,"city":"new york"}
    {"name":"jordan","age":18,"city":"toronto"}
    

    Disclaimer: this was tested in Cloudera CDH 5.12.0