SQL gurus!
My postgres 15 database is running under the SERIALIZABLE
isolation level and I'm occasionally seeing serialization_failure (40001)
errors with an odd error message:
DETAIL: Reason code: Canceled on identification as a pivot, with conflict out to old committed transaction 39594340.
We have limited automated retries for this class of error but in a recent case, nothing could touch the affected tables for over an hour -- any UPDATE
attempt would fail with the same error.
What's odd to me is that the error cites txn 39594340
, which successfully committed more than 15 minutes before the error messages started showing up in our logs.
This particular variant of error doesn't yield much on Google. Perhaps the most relevant is https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg156458.html which talks about how predicate locks that run out of shared memory (or txns that hit the max_pred_locks_per_transaction
limit) can lose a LOT of granularity, particularly if they swap to disk (even the DB can be lost!?!?)
Anywho in our case, it seems we have a very complex, long-running txn (call it txnA
) that is almost certainly hitting the default max_pred_locks_per_transaction=64
limit. Then we have these other much simpler txns that are basically just executing 1-2 UPDATE
statements (txnB
).
Both the committed txn (old committed transaction 39594340
) and the failing-with-serialization-error txn are of the simple txnB
type, but may be happening concurrently with txnA
. Meanwhile, txnA
isn't touching the same tables as txnB
, but started before both and was running still when we started to see errors.
It seems maybe there's some tangled web of dependencies, related to loss of specificity in the long-running txn's predicate locks. Is that hypothesis crazy?
How else could a committed txn affect a current txn?
TIA for any insights here!
It is hard to say what exactly happened, but the predicate locks taken on the SERIALIZABLE
isolation level are kept around until the last concurrent transaction is gone. So there must have been a long-running transaction involved.