In PostgreSQL I have to fetch ids in small batches and delete them at same time, so that no 2 clients will fetch same ids.
Sadly, DELETE
does not support LIMIT
, so I came up with this query:
DELETE FROM codes
WHERE id IN (SELECT id FROM codes WHERE product = 'abc' LIMIT 100 FOR UPDATE)
RETURNING id
Sadly, this is slow and kind of serial algorithm (under certain circumstances) - you cannot SELECT
while DELETE
is running, not to select same id two times. And you cannot DELETE
until you have SELECT
result.
Which results in poor performance, when many clients came in parallel. What are my best options to parallelize/speedup it all?
PS: when product are different from different clients - database parallelizes it well, but when they are same - response time starts to grow linearly with amount of parallel clients requesting.
Assuming you don't care about the order in which codes are consumed - indicated by the absence of ORDER BY
in your query.
Use SKIP LOCKED
to allow parallelization. Else, multiple concurrent calls regularly pile up, waiting for the previous one(s) to finish.
Also, materialize your selection in a CTE, as the LIMIT
clause does not interact well with a locking clause. See:
WITH sel AS MATERIALIZED (
SELECT id
FROM codes
WHERE product = 'abc'
LIMIT 100
FOR UPDATE SKIP LOCKED
)
DELETE FROM codes c
USING sel s
WHERE c.id = s.id
RETURNING c.id;
The MATERIALIZED
keyword is not strictly needed, as this CTE isn't going to be inlined anyway. But it won't hurt for clarity.
Now you can run multiple instances of this command, each in a separate session, concurrently and safely.
To make sure all rows have been processed, check after you think you are done. Like:
SELECT EXISTS (SELECT FROM codes WHERE product = 'abc');
If there are still rows left, run a final command without SKIP LOCKED
. Loop the last two steps, to be absolutely sure. Or just loop the command without SKIP LOCKED
.
An index on codes(product)
should help if the filter WHERE product = 'abc'
is actually selective, i.e. only a very small percentage of total rows. Else, it's typically more cost than help for this, as the index adds write cost.