We are trying to optimize a query to a partitioned table, the query looks something like this:
SELECT col1, col2
FROM partitioned_table
WHERE profile_id = '00000000-0000-0000-0000-000000000000'
AND product_id = 'product_a'
ORDER BY created_at DESC
LIMIT 500;
Parent/partitioned table definition:
CREATE TABLE public.partitioned_table (
trade_id integer NOT NULL,
product_id character varying NOT NULL,
settled boolean DEFAULT false NOT NULL,
user_id public.mongo_id NOT NULL,
profile_id uuid NOT NULL,
created_at timestamp with time zone NOT NULL
)
PARTITION BY RANGE (created_at);
Index used to scan:
CREATE INDEX partitioned_profile_id_product_id_trade_id_idx ON ONLY public.partitioned_table USING btree (profile_id, product_id, trade_id) INCLUDE (created_at);
The definition is on the partitioned table, partitions themselves were created after the index is added to the partitioned table, so they have the same set of indexes.
Each partition contains a day worth of data, about 12 million rows.
We run Postgres 14.5 on AWS RDS.
This is the query plan:
QUERY PLAN
----------------------------------------------------------------------------------------------
Limit (cost=944.59..945.84 rows=500 width=202) (actual time=39.501..39.691 rows=500 loops=1)
-> Sort (cost=944.59..947.09 rows=1000 width=202) (actual time=39.499..39.660 rows=500 loops=1)
Sort Key: partitioned_table.created_at DESC
Sort Method: top-N heapsort Memory: 290kB
-> Append (cost=0.71..894.76 rows=1000 width=202) (actual time=0.030..27.204 rows=32867 loops=1)
-> Index Scan using partitioned_table_profile_id_product_id_trade_id_idx on partitioned_table_legacy partitioned_table_1 (cost=0.71..772.99 rows=379 width=116) (actual time=0.029..22.550 rows=32838 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Index Scan using partition_20240601_profile_id_product_id_trade_id_created_idx on partition_20240601 partitioned_table_2 (cost=0.56..12.65 rows=5 width=117) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Index Scan using partition_20240602_profile_id_product_id_trade_id_created_idx on partition_20240602 partitioned_table_3 (cost=0.56..12.65 rows=5 width=117) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Index Scan using partition_20240603_profile_id_product_id_trade_id_created_idx on partition_20240603 partitioned_table_4 (cost=0.56..18.68 rows=8 width=117) (actual time=0.014..0.017 rows=3 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Index Scan using partition_20240604_profile_id_product_id_trade_id_created_idx on partition_20240604 partitioned_table_5 (cost=0.56..4.58 rows=1 width=117) (actual time=0.013..0.014 rows=2 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Index Scan using partition_20240605_profile_id_product_id_trade_id_created_idx on partition_20240605 partitioned_table_6 (cost=0.56..16.66 rows=7 width=117) (actual time=0.020..0.021 rows=2 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Index Scan using partition_20240606_profile_id_product_id_trade_id_created_idx on partition_20240606 partitioned_table_7 (cost=0.56..14.67 rows=6 width=117) (actual time=0.013..0.014 rows=1 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Index Scan using partition_20240607_profile_id_product_id_trade_id_created_idx on partition_20240607 partitioned_table_8 (cost=0.56..36.90 rows=17 width=117) (actual time=0.015..0.037 rows=21 loops=1)
Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Seq Scan on partition_20240608 partitioned_table_9 (cost=0.00..0.00 rows=1 width=265) (actual time=0.014..0.015 rows=0 loops=1)
Filter: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
-> Seq Scan on partition_20240609 partitioned_table_10 (cost=0.00..0.00 rows=1 width=265) (actual time=0.004..0.004 rows=0 loops=1)
Filter: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
...
Query and query plan obfuscated, and the query plan continues on and does sequential scan for all future/empty partitions.
Upon observing the query plan I have two questions:
Although we specified ORDER BY created_at DESC
, query plan still scans the partitions in forward chronological order, can it be reversed since it is a backward sort ?
We pro-actively created two years worth of future partitions to reduce operational cost. However, since this query does not have the partitioning column created_at
in the WHERE
clause, it is scanning all the future/empty partitions even after enough records is fetched, basically ignoring the LIMIT
clause. How to make it stop scanning when it finds enough records?
I've been mostly reading docs, have not been able to find much insight.
We want a plan with Merge Append
, and partitions listed in (descending) order of their partition key, and Postgres stopping the scan as soon as LIMIT
is satisfied. Like here:
But we actually see Append
→ Sort
→ Limit
The release notes for Postgres 15 have this juicy item:
Allow ordered scans of partitions to avoid sorting in more cases (David Rowley)
Previously, a partitioned table with a
DEFAULT
partition or aLIST
partition containing multiple values could not be used for ordered partition scans. Now they can be used if such partitions are pruned during planning.
(There were many other improvements, so upgrading to a current version should help in any case!)
Indeed, it would seem you have such a default partition:
-> Index Scan using partitioned_table_profile_id_product_id_trade_id_idx on partitioned_table_legacy partitioned_table_1 (cost=0.71..772.99 rows=379 width=116) (actual time=0.029..22.550 rows=32838 loops=1)
Bold emphasis mine.
You don't tell us, but the deviating table name indicates as much.
You would still have to exclude the default partition in current Postgres to make it work. (And merge it manually.) But we have an explanation why it does not work in the first place.
The second item of interest here is the bad estimate (not the core problem). Postgres expects 379
rows, but finds 32838
. Estimates for combined filters are notoriously hard, but that's still bad.
Are your column statistics up to date? Run ANALYZE partitioned_table_legacy
and test again.
Increasing the STATISTIC
target might help. See:
But for combined filters, you may really need extended statistics like:
CREATE STATISTICS (mcv) ON profile_id, product_id FROM public.partitioned_table;
Related:
If all else fails, you have to do it manually. I would write a PL/pgSQL function that iterates through partitions with dynamic SQL and adaptive LIMIT
. That's beyond my limits for unpaid work. Related:
Once you achieve an ordered partition scan (with upgraded Postgres and/or improved query) this index would be ideal:
CREATE INDEX ON public.partitioned_table (profile_id, product_id, created_at DESC);
If col1, col2
are small columns, you might attach them to the index with an INCLUDE
clause to allow index-only scans.
Do not include the ONLY
keyword at this time, since we want that index created for every existing partition, too.
it is scanning all the future/empty partitions even after enough records is fetched,
Your query says ORDER BY created_at DESC
. So those future partitions come first and have to be scanned in any case. If you want to exclude future rows, add a WHERE
clause accordingly.