Our logical replication has started to fail with a lot of these errors on the subscriber:
ERROR: could not create replication slot
"pg_148211637_sync_148178452_7161337762731575223": ERROR: all replication slots are in use
May 2 07:43:01 psql-06 postgres[1195903]: [2] HINT: Free one or increase max_replication_slots.
May 2 07:43:01 psql-06 postgres[1181479]: [1] LOG: background worker "logical replication worker" (PID 1195903) exited with exit code 1
May 2 07:43:01 psql-06 postgres[1195905]: [1] LOG: logical replication table synchronization worker for subscription "sub_production", table "products" has started
May 2 07:43:01 psql-06 postgres[1195901]: [1] LOG: could not drop replication slot "pg_148211637_sync_148183118_7161337762731575223" on publisher: ERROR: replication slot "pg_148211637_sync_148183118_7161337762731575223" does not exist
It seems like it's constantly trying drop the replication slot pg_148211637_sync_148183118_7161337762731575223
on the publisher, but somehow it does not exists?
How can such a situation occur, and how do I resolve it/resume the replication?
Connect to the primary and look for abandoned replication slots there:
SELECT slot_name, slot_type, restart_lsn
FROM pg_replication_slots
WHERE NOT active;
Any of these that don't belong to a standby server that's still needed should go:
SELECT pg_drop_replication_slot('name of the slot');
If you need more than 10 replication slots, increase max_replication_slots
on the primary and restart it.
It is hard to say how you got into trouble, but probably like this: whenever you say CREATE SUBSCRIPTION
, the standby connects to the primary and creates a logical replication slot (unless you explicitly tell it not to). If then something goes wrong with creating the subscription, of if replication breaks, the replication slot is still there on the primary. So perhaps it took many failed attempts until you got logical replication running, and you neglecting cleaning up after failed attempts.