postgresql amazon-rds failover debezium logical-replication

Debezium with RDS postgres and master-replica failover

I have an RDS multi-AZ postgres database (primary-standby) and I'm investigating Debezium to stream changes to Kafka.

I was reading the documentation of what could go wrong in case of failover: https://debezium.io/documentation/reference/1.1/connectors/postgresql.html#_cluster_failures and it looks like a pretty scary scenario.

From some tests I made by rebooting with failover, it seems that when the endpoint changes from the primary to the standby the Debezium connector keeps working and automatically creates a replication slot on the standby. But from what I understand there's no guarantee that there won't be data loss unless you can ensure that the replication slot on the new primary (the old standby) is created before new data is written.

Does anybody have experience with this setup? How do you manage things in case of failover?

Solution

As of 2022, there are some new advances in this space using Patroni which are described in the Percona blog: How Patroni Addresses the Problem of the Logical Replication Slot Failover in a PostgreSQL Cluster

The gist of the approach from the above:

This solution requires PostgreSQL 11 or above because it uses the pg_replication_slot_advance() function which is available from PostgreSQL 11 onwards, for advancing the slot.

The downstream connection can use HAProxy so that the connection will be automatically routed to the primary (not covered in this post). No modification to PostgreSQL code or Creation of any extension is required.

The copying of the slot happens over PostgreSQL protocol (libpq) rather than any OS-specific tools/methods. Patroni uses rewind or superuser credentials. Patroni uses the pg_read_binary_file() function to read the slot information.

Once the logical slot is created on the replica side, Patroni uses pg_replication_slot_advance() to move the slot forward.

The permanent slot information will be added to DCS and will be continuously maintained by the primary instance of the Patroni. A New DCS key with the name “status” is introduced and supported across all DCS options (zookeeper, etcd, consul, etc.).

hot_standby_feedback must be enabled on all standby nodes where the logical replication slot needs to be maintained.

Patroni parameter postgresql.use_slots must be enabled to make sure that every standby node uses a slot on the primary node.