I have parquet files arranged in this format
/db/{year}/table{date}.parquet
In each year folder, there are up to 365 files.
If I want to query data from a time range, say the week 2024-04-28 to 2024-05-04, I could use
SELECT
count(*) as count,
FROM read_parquet('/db/2024/table*.parquet')
WHERE date >= '2024-04-28' and date < '2024-05-05'
But I don't need to read all files in /db/2024/table*.parquet. I know exactly which seven files has the data I need. How do I define this in duckdb? I am using Python. I could do my own filtering and put all the files in a Python list like filenames
.
you can pass python list into duckdb query:
import json
files = json.dumps(["df1.parquet", "df3.parquet"])
duckdb.sql(f"""select * from read_parquet({files});""")