I have the following query that uses the psql SKIP LOCKED
feature to poll messages from a queue table sorted by the insertion time. My subquery also limits the result set by 10 since that is the maximum number of results that can be processed by the consumer.
DELETE FROM outbox_queue
WHERE id IN (
SELECT id FROM outbox_queue
WHERE delay_until IS NULL OR now() > delay_until
ORDER BY enqueued_on, id
LIMIT 10
FOR UPDATE SKIP LOCKED
)
RETURNING *
Under normal load, everything works as expected. However when I apply load tests to my system, I will occasionally (<5%) get results the range from dozens of records to even several hundred records being returned. This is confusing to me since my understanding is that the subquery should have a limit placed on the number of records returned.
I've checked my logs and verified that the records being pulled are indeed unique and are not duplicates. Once load is reduced, the behavior returns to what I would expect.
My application has three workers that poll messages from this table, and each worker has a single thread that runs this query and processes the result.
Postgres version:
PostgreSQL 10.18 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit
What am I missing?
Edit: Also including query plans for my old query posted in problem, as well as query plan for Erwin's answer.
Original query plan:
Delete on outbox_queue (cost=1.87..85.97 rows=10 width=38) (actual time=1.324..1.358 rows=10 loops=1)
-> Nested Loop (cost=1.87..85.97 rows=10 width=38) (actual time=1.316..1.339 rows=10 loops=1)
-> HashAggregate (cost=1.44..1.54 rows=10 width=40) (actual time=1.297..1.300 rows=10 loops=1)
" Group Key: ""ANY_subquery"".id"
" -> Subquery Scan on ""ANY_subquery"" (cost=0.42..1.42 rows=10 width=40) (actual time=0.034..1.289 rows=10 loops=1)"
-> Limit (cost=0.42..1.32 rows=10 width=22) (actual time=0.030..1.280 rows=10 loops=1)
-> LockRows (cost=0.42..90221.41 rows=1009865 width=22) (actual time=0.030..1.277 rows=10 loops=1)
-> Index Scan using outbox_queue__enqueued_on on outbox_queue outbox_queue_1 (cost=0.42..80122.76 rows=1009865 width=22) (actual time=0.023..1.256 rows=10 loops=1)
Filter: ((delay_until IS NULL) OR (now() > delay_until))
-> Index Scan using outbox_queue_pkey on outbox_queue (cost=0.42..8.44 rows=1 width=14) (actual time=0.003..0.003 rows=1 loops=10)
" Index Cond: (id = ""ANY_subquery"".id)"
Planning time: 0.230 ms
Execution time: 1.407 ms
Query plan for Erwin's answer. As you can see, the LIMIT is now being evaluated outside the nested loop, whereas previously it was being evaluated inside.
Delete on outbox_queue (cost=1.97..86.07 rows=10 width=38) (actual time=0.065..0.094 rows=10 loops=1)
CTE sel
-> Limit (cost=0.42..1.32 rows=10 width=22) (actual time=0.030..0.039 rows=10 loops=1)
-> LockRows (cost=0.42..90221.41 rows=1009865 width=22) (actual time=0.029..0.037 rows=10 loops=1)
-> Index Scan using outbox_queue__enqueued_on on outbox_queue outbox_queue_1 (cost=0.42..80122.76 rows=1009865 width=22) (actual time=0.024..0.027 rows=10 loops=1)
Filter: ((delay_until >= now()) IS NOT TRUE)
-> Nested Loop (cost=0.65..84.75 rows=10 width=38) (actual time=0.061..0.080 rows=10 loops=1)
-> HashAggregate (cost=0.23..0.33 rows=10 width=40) (actual time=0.053..0.056 rows=10 loops=1)
Group Key: sel.id
-> CTE Scan on sel (cost=0.00..0.20 rows=10 width=40) (actual time=0.034..0.049 rows=10 loops=1)
-> Index Scan using outbox_queue_pkey on outbox_queue (cost=0.42..8.44 rows=1 width=14) (actual time=0.002..0.002 rows=1 loops=10)
Index Cond: (id = sel.id)
Planning time: 0.153 ms
Execution time: 0.174 ms
And here is my schema for the outbox_queue
table.
create table if not exists outbox_queue (
id bigserial not null primary key,
message_type varchar not null,
body varchar not null,
enqueued_on timestamptz not null,
delay_until timestamptz,
attributes jsonb not null
);
create index if not exists outbox_queue__enqueued_on on outbox_queue(enqueued_on);
create index if not exists outbox_queue__delay_until on outbox_queue(delay_until);
The query plan for this particular combination can result in a Nested Loop
, which can exceed the LIMIT
set in the subquery. See:
Rows are locked after they have been returned from the subquery, and running this in a nested loop in combination with SKIP LOCKED
can sidestep the LIMIT
. I have not fully verified the exact mechanism to this day. Also, this may have been fixed in recent versions. Didn't rerun tests, yet.
Materializing the selection in a CTE reliably prevents the problem in any case:
WITH sel AS (
SELECT id
FROM outbox_queue
WHERE delay_until >= now() IS NOT TRUE -- simpler & may be faster
-- WHERE delay_until IS NULL OR now() > delay_until
ORDER BY enqueued_on, id
LIMIT 10
FOR UPDATE SKIP LOCKED
)
DELETE FROM outbox_queue
WHERE id IN (TABLE sel)
RETURNING *
While being at it I replaced:
delay_until IS NULL OR now() > delay_until
with:
delay_until >= now() IS NOT TRUE
Equivalent. Simpler. May be faster.
Postgres 12 added the MATERIALIZED
option. Explicitly materialize to be sure:
WITH sel AS MATERIALIZED ( -- !
SELECT id FROM outbox_queue
WHERE delay_until >= now() IS NOT TRUE
ORDER BY enqueued_on, id
LIMIT 10
FOR UPDATE SKIP LOCKED
)
DELETE FROM outbox_queue
WHERE id IN (TABLE sel)
RETURNING *
By default, a side-effect-free
WITH
query is folded into the primary query if it is used exactly once in the primary query'sFROM
clause. This allows joint optimization of the two query levels in situations where that should be semantically invisible. However, such folding can be prevented by marking theWITH
query asMATERIALIZED
. That might be useful, for example, if theWITH
query is being used as an optimization fence to prevent the planner from choosing a bad plan. PostgreSQL versions before v12 never did such folding, so queries written for older versions might rely onWITH
to act as an optimization fence.