I have a large table:
CREATE TABLE "orders" (
"id" serial NOT NULL,
"person_id" int4,
"created" int4,
CONSTRAINT "orders_pkey" PRIMARY KEY ("id")
);
90% of all requests are about orders from the last 2-3 days by a person_id
, like:
select * from orders
where person_id = 1
and created >= extract(epoch from current_timestamp)::int - 60 * 60 * 24 * 3;
How can I improve performance?
I know about Partitioning, but what about existing rows? And it looks like I need to create INHERITS
tables manually every 2-3 days.
A multicolumn index on (person_id, created)
(index fields in this order!) should be good enough for most setups.
Would table partitioning work for you? (In Postgres 12+ where the feature finally matured.) Then you might not need the rest of this answer.
B-tree indices are very efficient. Index depth grows very slowly with the size of the table. Cutting off old rows with a partial index typically only pays to cut off the vast majority of rows in a very big table. Benefits: dramatically smaller index size, moderately faster queries utilizing that index. Or substantially faster queries if you don't have enough RAM to cache (required parts of) the full index.
If filter values in queries are transparent for the query planner, it understands that the partial index is applicable for later timestamps at query planning time. A plain partial index with an actually immutable condition (3 or more days back represented by a unix epoch in your case) does the job:
CREATE INDEX orders_created_recent_idx ON orders (person_id, created)
WHERE created >= 1387497600;
1387497600
being the result of:
SELECT extract(epoch FROM CURRENT_DATE - 3)::int;
Needs to be recreated with a bigger value from time to time to cut off added, aging rows and keep the performance up.
NOTE: CURRENT_DATE
depends on the timezone
setting of the session.
To base the whole operation (incl. below function) on a given timezone use instead:
SELECT extract(epoch FROM date_trunc('day', now(), 'UTC'))::int;
Replace "UTC" with your desired time zone name. date_trunc()
taking the time zone as 3rd parameter requires Postgres 12+.
Postgres can infer that the partial index is applicable for filters with newer (greater) timestamps. You'll see the index utilized.
If filter values in queries are not transparent for the query planner, we need to tell Postgres the index is applicable - by adding a fake-IMMUTABLE
condition.
Encapsulate the cut-off in a primitive, IMMUTABLE
function:
CREATE OR REPLACE FUNCTION f_orders_idx_start()
RETURNS int
LANGUAGE sql IMMUTABLE PARALLEL SAFE COST 1
RETURN 1387497600; -- replace with current value!
PARALLEL SAFE
requires Postgres 10+.
The short "SQL-standard" syntax for CREATE FUNCTION
requires Postgres 14+. See:
Base your partial index on this function:
CREATE OR REPLACE INDEX orders_created_recent_idx ON orders (person_id, created)
WHERE created >= f_orders_idx_start(); -- !
Add the same condition to your queries, even if logically redundant. (It's your responsibility not to do that for filters that would pre-date the partial index.)
SELECT *
FROM orders
WHERE person_id = 1
AND created >= f_orders_idx_start() -- match partial idx condition
AND created >= extract(epoch from now())::integer - 259200; -- actual condition
Now Postgres realizes the partial index is applicable.
Recreate function and index from time to time at low activity periods (possibly with a cron-job?) - using this procedure:
CREATE OR REPLACE PROCEDURE orders_reindex_partial()
LANGUAGE plpgsql AS
$proc$
DECLARE
-- 3 days back, starting at 00:00
_start int := extract(epoch FROM CURRENT_DATE - 3)::int;
-- _start int := extract(epoch FROM date_trunc('day', now(), 'UTC'))::int;
BEGIN
IF _start = f_orders_idx_start() THEN
-- do nothing
ELSE
-- Recreate IMMUTABLE function
EXECUTE format('
CREATE OR REPLACE FUNCTION f_orders_idx_start()
RETURNS int
LANGUAGE sql IMMUTABLE PARALLEL SAFE COST 1
RETURN %s'
, _start
);
-- Recreate partial index
REINDEX INDEX orders_created_recent_idx; -- not allowed inside transaction block for partitioned index
END IF;
END
$proc$;
Then, to rebase your index, call (ideally with little to no concurrent load):
CALL orders_reindex_partial(); -- that's all
If you never call this function, performance slowly deteriorates over time with the growing partial index. All queries continue to work wither way.
If you cannot afford dropping and recreating the index due to concurrent load, consider REINDEX CONCURRENTLY
in Postgres 12+. But that cannot run inside a transaction context, so also not inside a function or procedure:
REINDEX CONCURRENTLY INDEX orders_created_recent_idx;