postgresqlindexingsql-execution-planpostgresql-performancepostgres-12

How to make Postgres use an index for a set of values?


I have a table with ~35M rows, and trying to find "processed" records to remove from time to time. There are 14 valid statuses, and 10 of them are processed.

id uuid default uuid_generate_v4() not null primary key,
fk_id uuid not null references fk_table,
-- ... other columns
created_date timestamptz default now() not null,
status varchar(128) not null

Values for status can be one of a,b,c,d,e,f,g,h,i,j,k,l,m,n (14)

The index is on (status,created_date).
A query like:

select id from table
where created_date < 'somedate'
and status = ANY('{a,b,c,d,e,f,g,h,i,j}') -- one of first 10 values

The query planner insists on using a full seq_scan, instead of the index.

Is there a trick to make Postgres use the index for the status = ANY part of the predicate?


Solution

  • If more than a few percent of rows qualify - or rather, if Postgres estimates as much - it will chose a sequential scan, which is faster for such a case.

    If, in fact, only few rows qualify, then your column statistics (and/or cost settings) are to blame for misleading estimates.

    If the cited index is only for purpose at hand, and only relatively few rows have a "processed" state, replace it with a partial index:

    CREATE INDEX foo ON tbl (created_date) WHERE status = ANY('{a,b,c..10}')
    

    Would make the index much smaller, the query faster, and the likelihood it gets used bigger.

    Either way, increasing the statistics target for created_date and status at least by a bit is most probably helpful. See:

    And more aggressive autovacuum settings for the table:

    Either way, for only "14 valid states", status varchar(128) seems tremendously wasteful.

    Also, the planner has gotten smarter for this in Postgres 16 as compared to Postgres 12. Postgres 16 detects common values even in input arrays with many elements (many more than you have distinct values in status), and switches the plan accordingly. Not sure about the old logic in Postgres 12, but for more than ? elements in the array, Postgres used to switch to generic estimates, which can generate poor results.

    But note, that either version can even adapt the plan for prepared statements, based on actual input.

    fiddle -- pg 16
    fiddle -- pg 12

    More depends on missing details ...