I have a partitioned table where I store order data of different customers, which I created like that (PG 13.8 AWS RDS Aurora):
CREATE TABLE order_with_returns PARTITION BY RANGE ("date");
And I created partitions like that
DO $$
DECLARE
partition_date DATE := DATE('2021-01-01');
partition_month VARCHAR := '';
values_from VARCHAR := '';
values_to VARCHAR := '';
BEGIN
FOR i IN 1..50 LOOP
partition_month := REPLACE(SUBSTRING(CAST(partition_date AS VARCHAR), 1, 7), '-', '_');
values_from := TO_TIMESTAMP(CAST(partition_date as VARCHAR), 'YYYY-MM-DD 00:00:00');
partition_date := partition_date + '1 month'::INTERVAL;
values_to := TO_TIMESTAMP(CAST(partition_date as VARCHAR), 'YYYY-MM-DD 00:00:00');
EXECUTE FORMAT(
'
CREATE TABLE order_with_returns_%s PARTITION OF order_with_returns
FOR VALUES FROM (''%s'') TO (''%s'');
',
partition_month, values_from, values_to
);
END LOOP;
END;
$$;
And finally, I added some indexes
CREATE UNIQUE INDEX order_with_returns_pkey ON order_with_returns USING btree (id, "date");
CREATE UNIQUE INDEX unique_owrv_idx ON order_with_returns USING btree (order_id, license_id, date, address_id, order_item_id);
Basically I tried to partition the table by month and then inserted all the data (in the order of 10^9 rows). The typical query that is executed is filtering on license_id and some date range. E.g.
WITH filtered_order_with_returns AS (
SELECT
...
FROM order_with_returns or
WHERE license_id = 'abc' AND (date::date BETWEEN '2023-10-01'::date AND '2023-10-31'::date)
)
SELECT ... <bunch of aggregations on data from filtered_order_with_returns CTE> ...
What I would expect the query planer to do, to only scan stuff from the October partition but instead I see this:
CTE filtered_order_with_returns
-> Gather (cost=1000.14..6218934.03 rows=332 width=181) (actual time=33477.287..34017.848 rows=1430 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=0.14..6217900.83 rows=148 width=181) (actual time=22617.904..22975.694 rows=477 loops=3)
-> Parallel Index Scan using order_with_returns_order_id_license_id_date_add_idx36 on order_with_returns_2024_01 orv_37 (cost=0.14..9.13 rows=1 width=408) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (license_id = 'abc'::text)
Filter: (((date)::date >= '2023-10-01'::date) AND ((date)::date <= '2023-10-31'::date))
-> Parallel Index Scan using order_with_returns_order_id_license_id_date_add_idx37 on order_with_returns_2024_02 orv_38 (cost=0.14..9.13 rows=1 width=408) (actual time=0.003..0.003 rows=0 loops=1)
Index Cond: (license_id = 'abc'::text)
Filter: (((date)::date >= '2023-10-01'::date) AND ((date)::date <= '2023-10-31'::date))
... <scans through all partitions>
This is not the behavior I hoped for and especially in terms of performance (> 30s) still very disappointing. I want to understand why the planner is not smart enough to utilize the partitioned nature of this table. What a tried so far in terms of debugging, was
Might this be related to the fact that I cast "date" to date and that the "date" column is a timestamptz type of field? Could I help the planner to find the right partition by adding constraint checks to each partition?
date
is not the same as date::date
, and since you partitioned by date
, PostgreSQL cannot use partition pruning. Simple as that.
Two suggestions:
Try this equivalent condition:
SELECT ...
FROM order_with_returns
WHERE license_id = 'abc'
AND date >= '2023-10-01 00:00:00'
AND date < '2023-11-01 00:00:00'
(If you are using timestamp with time zone
, you may want to add a timezone.)
Don't name columns date
if they aren't, and don't use SQL keywords as identifiers to begin with.