pythonparquet

Python: Obtain number of rows for ParquetDataset?


How do I obtain the number of rows of a ParquetDataset that is structured in the form of a folder containing multiple parquet files.

I tried

from pyarrow.parquet import ParquetDataset
a = ParquetDataset(path)
a.metadata
a.schema
a.commmon_metadata

I want to figure out the number of rows in total without reading the dataset as it can quite large.

What's the best way to do that?


Solution

  • You will still have to touch each individual file but luckily Parquet saves the total row count of each file in its footer. Thus you will only need to read the metadata of each file to figure out its size. The following code will compute the number of rows in the ParquetDataset

    nrows = 0
    dataset = ParquetDataset(..)
    for piece in dataset.pieces:
        nrows += piece.get_metadata().num_rows