rediskubernetesredis-ha

redis-ha in kubernetes cannot failover back to master


I am trying to create a simple redis high availability setup with 1 master, 1 slave and 2 sentinels.

The setup works perfectly when failing over from redis-master to redis-slave. When redis-master recovers, it correctly register itself as slave to the new redis-slave master.

However, when redis-slave as a master goes down, redis-master cannot return as master. The log of redis-master go into the loop showing:

1:S 12 Dec 11:12:35.073 * MASTER <-> SLAVE sync started
1:S 12 Dec 11:12:35.073 * Non blocking connect for SYNC fired the event.
1:S 12 Dec 11:12:35.074 * Master replied to PING, replication can continue...
1:S 12 Dec 11:12:35.075 * Trying a partial resynchronization (request 684581a36d134a6d50f1cea32820004a5ccf3b2d:285273).
1:S 12 Dec 11:12:35.076 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 12 Dec 11:12:36.081 * Connecting to MASTER 10.102.1.92:6379
1:S 12 Dec 11:12:36.081 * MASTER <-> SLAVE sync started
1:S 12 Dec 11:12:36.082 * Non blocking connect for SYNC fired the event.
1:S 12 Dec 11:12:36.082 * Master replied to PING, replication can continue...
1:S 12 Dec 11:12:36.083 * Trying a partial resynchronization (request 684581a36d134a6d50f1cea32820004a5ccf3b2d:285273).
1:S 12 Dec 11:12:36.084 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 12 Dec 11:12:37.087 * Connecting to MASTER 10.102.1.92:6379
1:S 12 Dec 11:12:37.088 * MASTER <-> SLAVE sync started
...

Per Replication doc, it states that:

Since Redis 4.0, when an instance is promoted to master after a failover, it will be still able to perform a partial resynchronization with the slaves of the old master.

But the log seems to show otherwise. More detail version of log showing both the first redis-master to redis-slave failover and subsequent redis-slave to redis-master log is available here.

Any idea what's going on? What do I have to do to allow the redis-master to return to master role? Configuration detail is provided below:

SERVICES

NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
redis-master     ClusterIP   10.102.1.92     <none>        6379/TCP    11m
redis-slave      ClusterIP   10.107.0.73     <none>        6379/TCP    11m
redis-sentinel   ClusterIP   10.110.128.95   <none>        26379/TCP   11m

redis-master config

requirepass test1234
masterauth test1234
dir /data

tcp-keepalive 60
maxmemory-policy noeviction
appendonly no
bind 0.0.0.0
save 900 1
save 300 10
save 60 10000

slave-announce-ip redis-master.fp8-cache
slave-announce-port 6379

redis-slave config

requirepass test1234
slaveof redis-master.fp8-cache 6379
masterauth test1234
dir /data

tcp-keepalive 60
maxmemory-policy noeviction
appendonly no
bind 0.0.0.0
save 900 1
save 300 10
save 60 10000

slave-announce-ip redis-slave.fp8-cache
slave-announce-port 6379

Solution

  • It turn out that the problem is related to the used of host name instead of IP:

    slaveof redis-master.fp8-cache 6379
    ...
    slave-announce-ip redis-slave.fp8-cache
    

    So, when the master came back as slave, sentinel shows that there are now 2 slaves: one with ip address and another with host name. Not sure exactly how does these 2 slave entries (that points to the same Redis server) cause the problem above. Now that I changed the config to use IP address instead of host name the Redis HA is working flawlessly.