javaredisamazon-elasticachelettuce

Redis client Lettuce Master/Slave configuration for AWS Elasticache


I have been using Lettuce as a Redis client to talk to AWS Elasticache. The specific configuration that I am currently using is the Static Master/Slave with predefined node addresses. Recently, the primary node took a tumble kicking off a failover process and eventually causing all application write requests to fail with the following error:

redis.RedisCommandExecutionException: READONLY You can't write against a read only slave.

Since then, I have been doing some research and realized that Standalone Master/Slave is probably the configuration that fits the purpose of talking to Elasticache (in non-clustered mode) as according to the AWS docs, the client should always only talk to the primary endpoint - which gets updated to point to the new master in an event of a failover.

This has left me wondering, why does the author make the recommendation of using the Static Master/Slave with predefined node addresses method when using AWS Elasticache?

Any thoughts?

Configuration: 1 Master and 2 Slave nodes


Solution

  • There are two answers to your question as AWS ElastiCache can be used in different ways:

    1. Using just the master node
    2. Using the master and replicas

    Explanation

    AWS ElastiCache (non-clustered) comes with its very own failover mechanism that does not notify your application when a failover happens. It depends on your use whether this is good or bad:

    Master-only use

    If you want to rely on failover and you don't want to use your replicas for additional reads, then master-only use is the way to go. For master-only use, you point your client to a DNS endpoint that always points to the primary node.

    If ElastiCache happens to failover, the client connection is reset. Behind the scenes, AWS promotes a replica to Master node status and the primary DNS endpoint routes to it. Once the client successfully reconnects, you're talking with the (new) master node again.

    Why is it not possible to use replicas in this scenario?

    The only topology source is the AWS ElastiCache node itself. lettuce does not connect to AWS's API (and this won't ever happen). Redis exposes connected replicas in the INFO REPLICATION section but: The ElastiCache Redis node reports replica IP addresses that are not reachable hence it's not possible to connect to these nodes via topology discovery.

    Using Master and Replicas

    Although it's not possible to deduce the replica endpoints from an ElastiCache server, it's still possible to provide static endpoints. Lettuce connects to all nodes and determines on startup the node roles. This allows again routing according to the node role. If a failover happens (as in your case), Lettuce does not get notified about the failover and sticks to the initial topology.

    Failover Notifications

    Failover Notifications are the missing bit. While Redis Sentinel provides notifications that indicate a promotion/role change, there's no mechanism for 'just' Master/Replica. You could say: Ok, let's a disconnect as a signal to trigger a topology update. That might work in some cases, but in much more cases (network partition between the application and the Redis nodes, connection timeouts) it would trigger updates without the need. A regular topology upgrade is also just an attempt to discover changes.

    The Third answer

    I'm not happy with the AWS ElastiCache implementation. It works OK for Master-only use, but as soon as you want to use replicas, you're relying on a proprietary implementation of failover. Without AWS failover (i.e. in your own data center/Redis setup), you would be notified by some Ops people that Redis is down. They would either restart the Redis node or restart the application to restore operations. These signals are missing.

    In the meantime, AWS provides Redis Cluster which might be the better HA/failover setup but Redis Cluster comes with severe limitations for applications. It could be possible also to poll on AWS' ElastiCache API to discover the topology from the API side of things and then kick off a topology update (reconnect).

    Lettuce's Master/Replica API for static topology use is to provide at least a way to work with replicas. Everything else derives from this experience. Contributions in any form (experience, suggestions, documentation, code) are welcome.

    Update: Aligned replica wording according to antirez/redis#5335