sql performance postgresql indexing postgresql-performance

Optimize performance for queries on recent rows of a large table

I have a large table:

CREATE TABLE "orders" (
"id" serial NOT NULL,
"person_id" int4,
"created" int4,
CONSTRAINT "orders_pkey" PRIMARY KEY ("id")
);

90% of all requests are about orders from the last 2-3 days by a person_id, like:

select * from orders
where person_id = 1
and created >= extract(epoch from current_timestamp)::int - 60 * 60 * 24 * 3;

How can I improve performance?

I know about Partitioning, but what about existing rows? And it looks like I need to create INHERITS tables manually every 2-3 days.

Solution

A multicolumn index on (person_id, created) (index fields in this order!) should be good enough for most setups.

Would table partitioning work for you? (In Postgres 12+ where the feature finally matured.) Then you might not need the rest of this answer.

Partial index for very big table

B-tree indices are very efficient. Index depth grows very slowly with the size of the table. Cutting off old rows with a partial index typically only pays to cut off the vast majority of rows in a very big table. Benefits: dramatically smaller index size, moderately faster queries utilizing that index. Or substantially faster queries if you don't have enough RAM to cache (required parts of) the full index.

Simple case: filter values transparent

If filter values in queries are transparent for the query planner, it understands that the partial index is applicable for later timestamps at query planning time. A plain partial index with an actually immutable condition (3 or more days back represented by a unix epoch in your case) does the job:

CREATE INDEX orders_created_recent_idx ON orders (person_id, created)
WHERE created >= 1387497600;

1387497600 being the result of:

SELECT extract(epoch FROM CURRENT_DATE - 3)::int;

Needs to be recreated with a bigger value from time to time to cut off added, aging rows and keep the performance up.

NOTE: CURRENT_DATE depends on the timezone setting of the session. To base the whole operation (incl. below function) on a given timezone use instead:

SELECT extract(epoch FROM date_trunc('day', now(), 'UTC'))::int;

Replace "UTC" with your desired time zone name. date_trunc() taking the time zone as 3rd parameter requires Postgres 12+.

Postgres can infer that the partial index is applicable for filters with newer (greater) timestamps. You'll see the index utilized.

Not so simple case: filter values not transparent

If filter values in queries are not transparent for the query planner, we need to tell Postgres the index is applicable - by adding a fake-IMMUTABLE condition.

Encapsulate the cut-off in a primitive, IMMUTABLE function:

CREATE OR REPLACE FUNCTION f_orders_idx_start()
  RETURNS int
  LANGUAGE sql IMMUTABLE PARALLEL SAFE COST 1
RETURN 1387497600;  -- replace with current value!

PARALLEL SAFE requires Postgres 10+.
The short "SQL-standard" syntax for CREATE FUNCTION requires Postgres 14+. See:

What does BEGIN ATOMIC mean in a PostgreSQL SQL function / procedure?

Base your partial index on this function:

CREATE OR REPLACE INDEX orders_created_recent_idx ON orders (person_id, created)
WHERE created >= f_orders_idx_start();  -- !

Add the same condition to your queries, even if logically redundant. (It's your responsibility not to do that for filters that would pre-date the partial index.)

SELECT *
FROM   orders
WHERE  person_id = 1
AND    created >= f_orders_idx_start()  -- match partial idx condition
AND    created >= extract(epoch from now())::integer - 259200;  -- actual condition

Now Postgres realizes the partial index is applicable.

Recreate function and index from time to time at low activity periods (possibly with a cron-job?) - using this procedure:

CREATE OR REPLACE PROCEDURE orders_reindex_partial()
  LANGUAGE plpgsql AS
$proc$
DECLARE
   -- 3 days back, starting at 00:00
   _start int := extract(epoch FROM CURRENT_DATE - 3)::int;
-- _start int := extract(epoch FROM date_trunc('day', now(), 'UTC'))::int;
BEGIN       
   IF _start = f_orders_idx_start() THEN
      -- do nothing
   ELSE
      -- Recreate IMMUTABLE function
      EXECUTE format('
         CREATE OR REPLACE FUNCTION f_orders_idx_start()
           RETURNS int
           LANGUAGE sql IMMUTABLE PARALLEL SAFE COST 1
         RETURN %s'
       , _start
      );
      -- Recreate partial index
      REINDEX INDEX orders_created_recent_idx;  -- not allowed inside transaction block for partitioned index
   END IF;    
END
$proc$;

Then, to rebase your index, call (ideally with little to no concurrent load):

CALL orders_reindex_partial();  -- that's all

If you never call this function, performance slowly deteriorates over time with the growing partial index. All queries continue to work wither way.

If you cannot afford dropping and recreating the index due to concurrent load, consider REINDEX CONCURRENTLY in Postgres 12+. But that cannot run inside a transaction context, so also not inside a function or procedure:

REINDEX CONCURRENTLY INDEX orders_created_recent_idx;