pandasparquetdata-partitioninghive-partitions

How to read filtered partitioned parquet files efficiently using pandas's read_parquet?


Let say my data stored in object storage, say s3, with date time partition like this:

s3://my-bucket/year=2021/month=01/day=03/SOME-HASH-VAL1.parquet
...
s3://my-bucket/year=2022/month=12/day=31/SOME-HASH-VAL1000.parquet

According to pandas's read_parquet api docs, I can use filters arg to retrieve just subset of the data like this:

pd.read_parquet(
    "s3://my-bucket/",
    filters=[("year", ">=",  2021)],
)

But the problems occur when I want to retrieve data after a specific date, say 2021-08-31:

pd.read_parquet(
    "s3://my-bucket/",
    filters=[("year", ">=",  2021), ("month", ">", 8)],
)

Looks like nothing wrong in here but it filter out the data range from 2022-01 ~ 2022-07.

Possible solution could be storing the data with partition like this:

s3://my-bucket/dt=2021-01-03/SOME-HASH-VAL.parquet
...
s3://my-bucket/dt=2022-12-31/SOME-HASH-VAL.parquet

But this could be a problem as time goes, say 10years later, because the bucket will have 365x10 = 3650 folders, which might cause a performance issue in reading data.

How can I solve these problems, wisely?


Solution

  • You can actually filter the data correctly with your data structure:

    filters=[
            [("year", ">", 2021)],
            [("year", "=", 2021), ("month", ">", 8)],
            [("year", "=", 2021), ("month", "=", 8), ("day", ">=", 31)]
    ]
    

    According to the doc:

    Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).