sqlpostgresqlindexingquery-optimizationdistinct

Extremely slow distinct query on indexed column


In a Postgres database, I am querying distinct values of MY_DATE in a large table with 300 million rows. There are about 400 of them and the column MY_DATE is indexed.

Select distinct  MY_DATE from MY_TABLE;

The query runs for 22 min.

The same query on my Oracle DB with the exact same data-set and the same index definition runs 11 seconds.

The query plan shows that the query is using the index:

EXPLAIN Select distinct  MY_DATE from MY_TABLE LIMIT 200;

gives:

QUERY PLAN
Limit  (cost=0.57..7171644.14 rows=200 width=8)
  ->  Unique  (cost=0.57..15419034.24 rows=430 width=8)
        ->  Index Only Scan using idx_obsdate on my_table  (cost=0.57..14672064.14 rows=298788038 width=8)

When I limit the results, the query can become much faster. Ee.g.

Select distinct  MY_DATE from MY_TABLE LIMIT 5;

runs in sub-seconds.

but:

Select distinct  MY_DATE from MY_TABLE LIMIT 50;

already takes minutes. Time seems to increase exponentially with the LIMIT clause.

I expect the Postgres query to run in seconds, as my OracleDB does. 20 minutes for an index scan - even for a large table - seems way off the mark.

Any suggestions what causes the issue and what I can do?


Solution

  • distinct values ... 300 million rows ... about 400 of them ... column ... indexed.

    There are much faster techniques for this. Emulating a loose index scan (a.k.a. skip scan), and assuming my_date is defined NOT NULL (or we can ignore NULL values):

    WITH RECURSIVE cte AS (
       SELECT min(my_date) AS my_date
       FROM   my_table
    
       UNION ALL
       SELECT (SELECT my_date
               FROM   my_table 
               WHERE  my_date > cte.my_date
               ORDER  BY my_date
               LIMIT  1)
       FROM   cte
       WHERE  my_date IS NOT NULL
       )
    TABLE  cte;
    

    Related:

    Using the index you mentioned it should finish in milliseconds.

    Oracle DB ... 11 seconds.

    Because Oracle has native index skip scans and Postgres does not. There are ongoing efforts to implement similar functionality in Postgres.

    Currently (still true for Postgres 16), while the index is used to good effect, even in an index-only scan, Postgres cannot skip ahead and has to read index tuples in sequence. Without LIMIT, the complete index has to be scanned. Hence we see in your EXPLAIN output:

    Index Only Scan ... rows=298788038

    The suggested new query achieves the same with reading 400 index tuples (one per distinct value). Big difference.

    With LIMIT (and no ORDER BY!) like you tested, Postgres stops as soon as enough rows are retrieved. Increasing the limit has a linear effect. But if the number of rows per distinct value can vary, so does the added cost.