When more no of actors in a shard the rebalancing is not happening properly when rolling restart of nodes.
When the nodes restarted one by one sequentially some shard not rebalancing to other node properly.
And other cluster sharding operations not working after the sequential restart.But the nodes still connected to each other in the cluster.
Using Akka-Cluster 2.5.32. Cluster formed with 3 nodes.
Each node will have 2 shards. In each shard there will be 600 entity Actors.
So totally 6 shards --> 3600 Entity Actors in the Cluster.
Enabled
akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []
akka.remote.artery{
enabled = on
transport = tcp
}
When Restarting the node will leave the cluster by invoking cluster.leave(cluster.selfAddress());
14:59:45:024
Node 1 Leaves the Cluster.
14:59:48:800
In Node 1 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
14:59:48:800
In Node 1 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:00:48:805
In Node 3 there is a log Rebalance shard [Shard_4] done [false]
& Rebalance shard [Shard_3] done [false]
15:00:51:935
In Node 2 The Shard_3 is Rebalanced And all 600 entity Actors recreated.
15:00:48:980
In Node 2 The Shard_4 is Rebalanced And all 600 entity Actors recreated.
15:01:23:209
Node 1 Rejoins the Cluster.
15:01:24:052
Node 2 Leaves the Cluster.
15:01:26:804
In Node 2 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
15:01:26:804
In Node 2 : Starting HandOffStopper for shard Shard_5 to terminate 600 entities.
15:01:26:804
In Node 2 : Starting HandOffStopper for shard Shard_1 to terminate 600 entities.
15:02:26:794
In Node 3 there is a log Rebalance shard [Shard_3] done [false]
,Rebalance shard [Shard_5] done [false]
& Rebalance shard [Shard_1] done [false]
15:02:27:028
In Node 1 The Shard_5 is Rebalanced And Only 580 entity Actors reacreated.
15:02:28:237
In Node 1 The Shard_1 is Rebalanced And Only 61 entity Actors recreated.
15:02:32:967
In Node 1 The Shard_3 is Rebalanced And Only 35 entity Actors recreated.
15:02:51:133
Node 2 Rejoins the Cluster.
15:02:51:799
Node 3 Leaves the Cluster.
15:02:55:621
In Node 3 : Starting HandOffStopper for shard Shard_0 to terminate 600 entities.
15:02:55:621
In Node 3 : Starting HandOffStopper for shard Shard_2 to terminate 600 entities.
15:02:55:622
In Node 3 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:03:55:772
In Node 2 The Shard_0 is Rebalanced And Only 116 entity Actors recreated.
And Shard_2 & Shard_4 not rebalance to Node 2.
15:04:27:943
Node 3 Rejoins the Cluster
We have a Singleton actor in the Cluster, which Logs the Current Cluster member Status for a time interval. And it will log the No of Actors in the Cluster by invoking
GetClusterShardingStats getStats = new ShardRegion.GetClusterShardingStats(FiniteDuration.create(5000, TimeUnit.MILLISECONDS));
Future<Object> ack = Patterns.ask(region, getStats, timeout).toCompletableFuture();
ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);
With this cluster stats The singlton will Log the Active Actor Count with Shard wise split-up.
Active Actor Count 3600 & Members [ Node1:Up,Node2:Up,Node3:Up ]
Members [ Node1:Up,Node2:Up,Node3:Up ]
And it unable to fetch the Active Actors Count from in
ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);
As Patrick pointed out on the Lightbend forum, Akka 2.5 is old and there have been many fixes to "Remember Entities" since then. Especially when dealing with large numbers of entities (3,600 isn't absurd but it's definitely enough that I suspect you could just be pushing 2.5 harder than it can sustain in "remember").
I'd suggest following his advice and moving to a recent version of Akka.
Also, many people misunderstand "Remember Entities". Consider if you really need it. Or if you can minimize the number of entities.
EDIT (to respond to comments):
But we need to know this issue occuring bcause of the oldest version or any configuration or design wise we are missing.
First, I just want to point out that you've asked a lot of questions on StackOverflow about Akka where you are struggling with design issues. I really suspect you are at the limits of where free internet advice can help you, you probably need to hire a consultant or get support from Lightbend.
Before I start responding to the question in the comments. One thing I want to reinforce is that your cluster is rebalancing just fine. The thing that is failing is the remembering. Unlike rebalancing, which is essentially just the singleton deciding which shards to put where, "remembering" involves both tracking and restarting individual actors. It's inherently a thundering herd problem and there's no way around it.
You are telling each node to immediately start 600 simultaneous actor recoveries. With each recover going to potentially be reconstructing its state from hundreds of events it has to replay sequentially. TAnd, if you are restarting nodes in rapid sequence you potentially are hitting the database with this thundering herd from multiple nodes at once. hen following each recovery you are updating the DDData eventually consistent database that with the actor state.
It looks like you are only giving the node 30 seconds to complete all 600 of those recoveries (and potentially tens of thousand sof events processed). Which, certainly isn't impossible, that's not an insane rate, but it certainly looks like you restarting the node before it has recovered all remembered entities completely.