I am using Spring Session to externalize our session to Redis (AWS ElastiCache). Lettuce is being used as our client to Redis.
My AWS Redis configuration is the following:
My Lettuce configuration is the following:
<!-- Lettuce Configuration -->
<bean class="org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory">
<constructor-arg ref="redisClusterConfiguration"/>
</bean>
<!-- Redis Cluster Configuration -->
<bean id="redisClusterConfiguration" class="org.springframework.data.redis.connection.RedisClusterConfiguration">
<constructor-arg>
<list>
<value><!-- AMAZON SINGLE ENDPOINT HERE --></value>
</list>
</constructor-arg>
</bean>
The issue appears when we trigger a failover of a master node. I get the following events logged during a test failover
myserver-0001-001
cache-cluster
Monday, July 6, 2020 at 8:25:32 PM UTC+3
Finished recovery for cache nodes 0001
myserver-0001-001
cache-cluster
Monday, July 6, 2020 at 8:20:38 PM UTC+3
Recovering cache nodes 0001
myserver
replication-group
Monday, July 6, 2020 at 8:19:14 PM UTC+3
Failover to replica node myserver-0001-002 completed
myserver
replication-group
Monday, July 6, 2020 at 8:17:59 PM UTC+3
Test Failover API called for node group 0001
AWS customer support claims that as long as the Redis client used is Redis Cluster aware when the Failover to replica node myserver-0001-002 completed event is fired (i.e. 1m and 15s after triggering the failover) it should be able to connect to it (i.e. the newly promoted master). Our client seems to reconnect only after the Finished recovery for cache nodes 0001 event is fired (i.e. 7m and 32s later). Meanwhile we get errors like the following
org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: CLUSTERDOWN The cluster is down
While the failover is taking place, the following information can be seen from the redis-cli.
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379@1122 master - 0 1594114396000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379@1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114396872 2 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379@1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379@1122 myself,master - 0 1594114395000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379@1122 master - 0 1594114461262 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379@1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114460000 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379@1122 master - 0 1594114460256 0 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379@1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379@1122 myself,master - 0 1594114458000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379@1122 master - 0 1594114509000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379@1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114510552 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379@1122 slave ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 0 1594114510000 9 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379@1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379@1122 myself,master - 0 1594114508000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379@1122 master - 0 1594114548000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379@1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114548783 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379@1122 slave ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 0 1594114547776 9 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379@1122 myself,master - 0 1594114547000 2 connected 8192-16383
As far as I understand, Lettuce used by Spring Session is Redis Cluster aware hence the RedisClusterConfiguration class used in the XML configuration. Checking the documentation and some similar questions here on SO as well as Lettuce's GitHub issues page, didn't make it clear to me how Lettuce works in Redis Cluster mode and specifically with AWS's trickery hiding the IPs under a common endpoint.
Shouldn't my configuration be enough for Lettuce to connect to the newly promoted master? Do I need to enable a different mode in Lettuce for it to be able to receive the notification from Redis and switch to the new master (e.b. topology refresh)?
Also, how is Lettuce handling the single endpoint from AWS? Is it resolving the IPs and then uses them? Are they cached?
If I want to enable reads to happen from all four nodes, is my configuration enough? In a Redis Cluster (i.e. even outside of AWS's context) when a slave is promoted to master is the client polling to get the information or is the cluster pushing it somehow to the client?
Any resources (even Lettuce source files) you have that could clarify the above as well as the different modes in the context of Lettuce, Redis, and AWS would be more than welcome.
As you can see I am a bit confused on this still.
Thanks a lot in advance for any help and information provided.
Debugging was enabled and breakpoints were used to intercept bean creation and configure topology refreshing that way. It seems that enabling the ClusterTopologyRefreshTask
through the constructor of ClusterClientOptions
:
protected ClusterClientOptions(Builder builder) {
super(builder);
this.validateClusterNodeMembership = builder.validateClusterNodeMembership;
this.maxRedirects = builder.maxRedirects;
ClusterTopologyRefreshOptions refreshOptions = builder.topologyRefreshOptions;
if (refreshOptions == null) {
refreshOptions = ClusterTopologyRefreshOptions.builder() //
.enablePeriodicRefresh(DEFAULT_REFRESH_CLUSTER_VIEW) // Breakpoint here and enter to enable refreshing
.refreshPeriod(DEFAULT_REFRESH_PERIOD_DURATION) // Breakpoint here and enter to set the refresh interval
.closeStaleConnections(builder.closeStaleConnections) //
.build();
}
this.topologyRefreshOptions = refreshOptions;
}
It seems to be refreshing OK, but the problem now is how to configure this when Lettuce is used through Spring Session and not as a plain client to Redis?
As I was going through my questions I realized I haven't answered that one! So here it is in case someone has the same issue.
What I ended up doing is creating a configuration bean for Redis instead of using XML. The code is as follows:
@EnableRedisHttpSession
public class RedisConfig {
private static final List<String> clusterNodes = Arrays.asList(System.getProperty("redis.endpoint"));
@Bean
public static ConfigureRedisAction configureRedisAction() {
return ConfigureRedisAction.NO_OP;
}
@Bean(destroyMethod = "destroy")
public LettuceConnectionFactory lettuceConnectionFactory() {
RedisClusterConfiguration redisClusterConfiguration = new RedisClusterConfiguration(clusterNodes);
return new LettuceConnectionFactory(redisClusterConfiguration, getLettuceClientConfiguration());
}
private LettuceClientConfiguration getLettuceClientConfiguration() {
ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder().enablePeriodicRefresh(Duration.ofSeconds(30)).build();
ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder().topologyRefreshOptions(topologyRefreshOptions).build();
return LettucePoolingClientConfiguration.builder().clientOptions(clusterClientOptions).build();
}
}
Then, instead of registering my ContextLoaderListener through XML I used an initializer instead like so:
public class Initializer extends AbstractHttpSessionApplicationInitializer {
public Initializer() {
super(RedisConfig.class);
}
}
This seems to set up refreshing OK, but I don't know if it is the proper way to do it! If anyone has any idea of a more proper solution please feel free to comment here.