Let say my data stored in object storage, say s3, with date time partition like this:
s3://my-bucket/year=2021/month=01/day=03/SOME-HASH-VAL1.parquet
...
s3://my-bucket/year=2022/month=12/day=31/SOME-HASH-VAL1000.parquet
According to pandas's read_parquet api docs, I can use filters
arg to retrieve just subset of the data like this:
pd.read_parquet(
"s3://my-bucket/",
filters=[("year", ">=", 2021)],
)
But the problems occur when I want to retrieve data after a specific date, say 2021-08-31:
pd.read_parquet(
"s3://my-bucket/",
filters=[("year", ">=", 2021), ("month", ">", 8)],
)
Looks like nothing wrong in here but it filter out the data range from 2022-01 ~ 2022-07.
Possible solution could be storing the data with partition like this:
s3://my-bucket/dt=2021-01-03/SOME-HASH-VAL.parquet
...
s3://my-bucket/dt=2022-12-31/SOME-HASH-VAL.parquet
But this could be a problem as time goes, say 10years later, because the bucket will have 365x10 = 3650 folders, which might cause a performance issue in reading data.
How can I solve these problems, wisely?
You can actually filter the data correctly with your data structure:
filters=[
[("year", ">", 2021)],
[("year", "=", 2021), ("month", ">", 8)],
[("year", "=", 2021), ("month", "=", 8), ("day", ">=", 31)]
]
According to the doc:
Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).