parquetduckdb

Querying multiple parquet files in a range using duckdb


I have parquet files arranged in this format

/db/{year}/table{date}.parquet

In each year folder, there are up to 365 files.

If I want to query data from a time range, say the week 2024-04-28 to 2024-05-04, I could use

SELECT
  count(*) as count,
FROM read_parquet('/db/2024/table*.parquet')
WHERE date >= '2024-04-28' and date < '2024-05-05'

But I don't need to read all files in /db/2024/table*.parquet. I know exactly which seven files has the data I need. How do I define this in duckdb? I am using Python. I could do my own filtering and put all the files in a Python list like filenames.


Solution

  • you can pass python list into duckdb query:

    import json
    
    files = json.dumps(["df1.parquet", "df3.parquet"])
    
    duckdb.sql(f"""select * from read_parquet({files});""")