sqlpostgresqlamazon-aurorapostgresql-13

Postgres scans all partitions on ranged query


I have a partitioned table where I store order data of different customers, which I created like that (PG 13.8 AWS RDS Aurora):

CREATE TABLE order_with_returns PARTITION BY RANGE ("date");

And I created partitions like that

DO $$
DECLARE
    partition_date DATE := DATE('2021-01-01');
    partition_month VARCHAR := '';
    values_from VARCHAR := '';
    values_to VARCHAR := '';
BEGIN
FOR i IN 1..50 LOOP
        partition_month := REPLACE(SUBSTRING(CAST(partition_date AS VARCHAR), 1, 7), '-', '_');
        values_from := TO_TIMESTAMP(CAST(partition_date as VARCHAR), 'YYYY-MM-DD 00:00:00');
        partition_date := partition_date + '1 month'::INTERVAL;
        values_to := TO_TIMESTAMP(CAST(partition_date as VARCHAR), 'YYYY-MM-DD 00:00:00');
EXECUTE FORMAT(
            '
                CREATE TABLE order_with_returns_%s PARTITION OF order_with_returns
                FOR VALUES FROM (''%s'') TO (''%s'');
            ',
            partition_month, values_from, values_to
        );
END LOOP;
END;
$$;

And finally, I added some indexes

CREATE UNIQUE INDEX order_with_returns_pkey ON order_with_returns USING btree (id, "date");
CREATE UNIQUE INDEX unique_owrv_idx ON order_with_returns USING btree (order_id, license_id, date, address_id, order_item_id);

Basically I tried to partition the table by month and then inserted all the data (in the order of 10^9 rows). The typical query that is executed is filtering on license_id and some date range. E.g.

WITH filtered_order_with_returns AS (
    SELECT 
        ...
    FROM order_with_returns or
    WHERE license_id = 'abc' AND (date::date BETWEEN '2023-10-01'::date AND '2023-10-31'::date) 
)
SELECT ... <bunch of aggregations on data from filtered_order_with_returns CTE> ...

What I would expect the query planer to do, to only scan stuff from the October partition but instead I see this:

  CTE filtered_order_with_returns
  ->  Gather  (cost=1000.14..6218934.03 rows=332 width=181) (actual time=33477.287..34017.848 rows=1430 loops=1)
      Workers Planned: 2
      Workers Launched: 2
      ->  Parallel Append  (cost=0.14..6217900.83 rows=148 width=181) (actual time=22617.904..22975.694 rows=477 loops=3)
            ->  Parallel Index Scan using order_with_returns_order_id_license_id_date_add_idx36 on order_with_returns_2024_01 orv_37  (cost=0.14..9.13 rows=1 width=408) (actual time=0.004..0.004 rows=0 loops=1)
                  Index Cond: (license_id = 'abc'::text)
                  Filter: (((date)::date >= '2023-10-01'::date) AND ((date)::date <= '2023-10-31'::date))
            ->  Parallel Index Scan using order_with_returns_order_id_license_id_date_add_idx37 on order_with_returns_2024_02 orv_38  (cost=0.14..9.13 rows=1 width=408) (actual time=0.003..0.003 rows=0 loops=1)
                  Index Cond: (license_id = 'abc'::text)
                  Filter: (((date)::date >= '2023-10-01'::date) AND ((date)::date <= '2023-10-31'::date))
... <scans through all partitions>

This is not the behavior I hoped for and especially in terms of performance (> 30s) still very disappointing. I want to understand why the planner is not smart enough to utilize the partitioned nature of this table. What a tried so far in terms of debugging, was

  1. limiting the WHERE filter to one specific date but I still get the same result
  2. trying a SET LOCAL enable_partitionwise_aggregate = on; in the same transaction to see whether this is connected to the aggregation that is performed in the query

Might this be related to the fact that I cast "date" to date and that the "date" column is a timestamptz type of field? Could I help the planner to find the right partition by adding constraint checks to each partition?


Solution

  • date is not the same as date::date, and since you partitioned by date, PostgreSQL cannot use partition pruning. Simple as that.

    Two suggestions:

    1. Try this equivalent condition:

      SELECT ...
      FROM order_with_returns
      WHERE license_id = 'abc'
        AND date >= '2023-10-01 00:00:00'
        AND date < '2023-11-01 00:00:00'
      

      (If you are using timestamp with time zone, you may want to add a timezone.)

    2. Don't name columns date if they aren't, and don't use SQL keywords as identifiers to begin with.