In a Postgres database, I am querying distinct values of MY_DATE
in a large table with 300 million rows. There are about 400 of them and the column MY_DATE
is indexed.
Select distinct MY_DATE from MY_TABLE;
The query runs for 22 min.
The same query on my Oracle DB with the exact same data-set and the same index definition runs 11 seconds.
The query plan shows that the query is using the index:
EXPLAIN Select distinct MY_DATE from MY_TABLE LIMIT 200;
gives:
QUERY PLAN
Limit (cost=0.57..7171644.14 rows=200 width=8)
-> Unique (cost=0.57..15419034.24 rows=430 width=8)
-> Index Only Scan using idx_obsdate on my_table (cost=0.57..14672064.14 rows=298788038 width=8)
When I limit the results, the query can become much faster. Ee.g.
Select distinct MY_DATE from MY_TABLE LIMIT 5;
runs in sub-seconds.
but:
Select distinct MY_DATE from MY_TABLE LIMIT 50;
already takes minutes. Time seems to increase exponentially with the LIMIT
clause.
I expect the Postgres query to run in seconds, as my OracleDB does. 20 minutes for an index scan - even for a large table - seems way off the mark.
Any suggestions what causes the issue and what I can do?
distinct values ... 300 million rows ... about 400 of them ... column ... indexed.
There are much faster techniques for this. Emulating a loose index scan (a.k.a. skip scan), and assuming my_date
is defined NOT NULL
(or we can ignore NULL values):
WITH RECURSIVE cte AS (
SELECT min(my_date) AS my_date
FROM my_table
UNION ALL
SELECT (SELECT my_date
FROM my_table
WHERE my_date > cte.my_date
ORDER BY my_date
LIMIT 1)
FROM cte
WHERE my_date IS NOT NULL
)
TABLE cte;
Related:
Using the index you mentioned it should finish in milliseconds.
Oracle DB ... 11 seconds.
Because Oracle has native index skip scans and Postgres does not. There are ongoing efforts to implement similar functionality in Postgres.
Currently (still true for Postgres 16), while the index is used to good effect, even in an index-only scan, Postgres cannot skip ahead and has to read index tuples in sequence. Without LIMIT
, the complete index has to be scanned. Hence we see in your EXPLAIN
output:
Index Only Scan ... rows=298788038
The suggested new query achieves the same with reading 400 index tuples (one per distinct value). Big difference.
With LIMIT
(and no ORDER BY
!) like you tested, Postgres stops as soon as enough rows are retrieved. Increasing the limit has a linear effect. But if the number of rows per distinct value can vary, so does the added cost.