apache-sparknestedparquetdata-files

What is the benefit of using nested data types in Parquet?


Is there any performance benefit resulting from the usage of using nested data types in the Parquet file format?

AFAIK Parquet files are usually created specifically for query services e.g. Athena, so the process which creates those might as well simply flatten the values - thereby allowing easier querying, simpler schema, and retaining the column statistics for each column.

What benefit is there to be gained by using nested data types e.g. struct?


Solution

  • There is a negative consequence keeping nested structure in parquet. The issue is spark predicate pushdown doesn't work properly if you have nested structure in the parquet file.

    So even if you are working with few fields in your parquet dataset spark will load and materialize the entire dataset.

    Here is the ticket which is opened for a long time regarding this issue.

    EDIT

    The issue has been resolved in spark 2.4 version.