parquetpyarrowapache-arrow

What is the difference between data_page_version=1.0 and 2.0 in parquet files?


In pyarrow, the parquet writer has the data_page_version parameter which is either "1.0" or "2.0" with the default of "1.0". I sometimes save files with "2.0" because 'hey higher version must be better, right?'. Other times I don't bother setting that option so I get the default. I've never noticed a difference or had a problem either way in using polars, pyarrow, duckdb (occasionally), or Azure Synapse.

The apache parquet site doesn't say anything about the data page version

The pyarrow write_table doc just says it doesn't impact types, etc but not how it does matter.

What is/are the importance/features/pitfalls of the data page version?


Solution

  • As background, Parquet file organizes rows in chunks called row groups. For each row group data is stored in a columnar manner (each column is referred to as a column chunk). Each column chunk is divided up into data pages.

    Data page V1 and Data page V2 are two different iterations of storing data. In particular the differences revolve around what metadata is stored, how it is stored and some other things, like whether rows (for repeated columns) can span page boundaries. Due to the introduction of column indices, keepings rows terminating at page boundaries needs to be supported for V1 as well (row indexes can be written in the underlying C++ library as of Arrow V13 or Arrow V14, I'm not clear if this feature is available in pyarrow yet).

    Separately, while not specified in the specification new data encoding types have been developed over time and where originally intended to be overall part of a V2 Parquet effort. So encodings and metadata has been somewhat intermingled. This can effect how implementations might choose an encoding.

    There is a stalled effort to try to make clarifications around these topics, maybe it will make some progress in 2024.

    TL;DR; This is all a little bit inscrutable from a pure consumers perspective. But my recommendation for compatibility purposes as well as staying on a well trodden path, I'd recommend using V1 data pages, until the community can agree on a path forward.