sqlpostgresqlindexingquery-optimizationpostgresql-performance

Compound index with date function doesn't allow index-only scans?


I'm trying to use a compound (multicolumn) index on a table to assist in creating daily report counts. I am using Postgres 13, and my table looks like this:

CREATE TABLE inquiries (
    id bigint NOT NULL,
    identity_id bigint NOT NULL,
    received_at timestamp(0) without time zone NOT NULL,
    purpose_id bigint NOT NULL,
    location_id bigint NOT NULL
);

CREATE INDEX "inquiries_DATE_index" ON inquiries USING btree
   (date(received_at), location_id, purpose_id, identity_id);

My query looks like this:

SELECT DATE(received_at), location_id, purpose_id, COUNT(DISTINCT identity_id)
FROM inquiries
WHERE (DATE(received_at) >= $1)
  AND (DATE(received_at) <= $2)
GROUP BY 1, 2, 3

Explain output looks like this:

GroupAggregate  (cost=43703.28..45785.49 rows=10950 width=19)
  Group Key: (date(received_at)), location_id, purpose_id
  ->  Sort  (cost=43703.28..44092.34 rows=155627 width=16)
        Sort Key: (date(received_at)), location_id, purpose_id
        ->  Bitmap Heap Scan on inquiries  (cost=5243.60..27622.21 rows=155627 width=16)
              Recheck Cond: ((date(received_at) >= '2023-11-01'::date) AND (date(received_at) <= '2023-11-30'::date))
              ->  Bitmap Index Scan on "inquiries_DATE_index"  (cost=0.00..5204.70 rows=155627 width=0)
                    Index Cond: ((date(received_at) >= '2023-11-01'::date) AND (date(received_at) <= '2023-11-30'::date))

The index doesn't seem to help, the query takes a long time to execute. If I add a date column to the table and use that instead of date(received_at) then the query works better and the query plan changes to:

GroupAggregate  (cost=0.43..85199.58 rows=10980 width=19)
  Group Key: pacific_date, location_id, purpose_id
  ->  Index Only Scan using inquiries_pacific_date_index on inquiries  (cost=0.43..77813.12 rows=727666 width=16)
        Index Cond: ((pacific_date >= '2023-11-01'::date) AND (pacific_date <= '2023-11-30'::date))

I guess I can do this if I can't find a better way but it seems kind of redundant. Is there a way I can write my original query so it makes better use of the index?


Solution

  • Drop-in fix

    Like Laurenz explained, index-only scans currently (pg 16) suffer from a corner-case limitation in Postgres. The manual:

    However, PostgreSQL's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index.

    The manual has more details. One workaround is to "include" the column itself in the index (replacing the old one):

    CREATE INDEX inquiries_date_plus_idx ON inquiries
       (date(received_at), location_id, purpose_id, identity_id) INCLUDE (received_at);
    

    Allows index-only scans for your original query. But it also increases the size of the index - by 8 bytes per row in your case.

    fiddle

    Better

    Create an index on bare columns, without expressions:

    CREATE INDEX inquiries_received_at_plus_idx ON inquiries
       (received_at, location_id, purpose_id, identity_id);
    

    (A plain index also often has additional utility for other purposes.)
    And adjust your query slightly, to be exactly equivalent:

    SELECT received_at::date, location_id, purpose_id, COUNT(DISTINCT identity_id)
    FROM   inquiries
    WHERE  received_at >= $1
    AND    received_at <  $2 + 1  -- !
    GROUP  BY 1, 2, 3;
    

    Input $1 and $2 must be type date, and received_at timestamp, as indicated in the question.

    count(DISTINCT col) is typically slow in my experience. This may be faster, yet:

    SELECT received_at::date, location_id, purpose_id, count(*) AS dist_identities
    FROM  (
       SELECT DISTINCT ON (1,2,3,4)
              received_at::date, location_id, purpose_id, identity_id
       FROM   inquiries
       WHERE  received_at >= $1
       AND    received_at <  $2 + 1
       ) sub
    GROUP  BY 1, 2, 3;
    

    fiddle

    And if there are many duplicates per (received_at::date, location_id, purpose_id, identity_id), an emulated index-skip scan may be much faster, yet. See:

    Upgrade

    Postgres has improved performance for big data with every major version over the past years. Consider upgrading to the latest version Postgres 16 (at the time of writing). Should give you an immediate, additional boost.