akka actor akka-stream akka-cluster akka-remoting

Akka - Actor Cluster Inconsistent Shard rebalance while Rolling restart of nodes

Issue

When more no of actors in a shard the rebalancing is not happening properly when rolling restart of nodes.
When the nodes restarted one by one sequentially some shard not rebalancing to other node properly.
And other cluster sharding operations not working after the sequential restart.But the nodes still connected to each other in the cluster.

Cluster Setup

Using Akka-Cluster 2.5.32. Cluster formed with 3 nodes.
Each node will have 2 shards. In each shard there will be 600 entity Actors.
So totally 6 shards --> 3600 Entity Actors in the Cluster.
Enabled

akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []

akka.remote.artery{
        enabled = on
        transport = tcp
}

At Start

Node 1 Shard 4,3 with [ 600,600 ] entities
Node 2 Shard 1,5 with [ 600,600 ] entities
Node 3 Shard 0,2 with [ 600,600 ] entities

Restart & Rebalance Flow

When Restarting the node will leave the cluster by invoking cluster.leave(cluster.selfAddress());

i)Node 1 Restarts

14:59:45:024 Node 1 Leaves the Cluster.
14:59:48:800 In Node 1 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
14:59:48:800 In Node 1 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:00:48:805 In Node 3 there is a log Rebalance shard [Shard_4] done [false] & Rebalance shard [Shard_3] done [false]
15:00:51:935 In Node 2 The Shard_3 is Rebalanced And all 600 entity Actors recreated.
15:00:48:980 In Node 2 The Shard_4 is Rebalanced And all 600 entity Actors recreated.

After Node 1 Restarts Now

Node 1 Down
Node 2 Shard 1,5,3 with [ 600,600,600 ] entities
Node 3 Shard 0,2,4 with [ 600,600,600 ] entities

ii)Node 2 Restarts

15:01:23:209 Node 1 Rejoins the Cluster.
15:01:24:052 Node 2 Leaves the Cluster.
15:01:26:804 In Node 2 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
15:01:26:804 In Node 2 : Starting HandOffStopper for shard Shard_5 to terminate 600 entities.
15:01:26:804 In Node 2 : Starting HandOffStopper for shard Shard_1 to terminate 600 entities.
15:02:26:794 In Node 3 there is a log Rebalance shard [Shard_3] done [false],Rebalance shard [Shard_5] done [false] & Rebalance shard [Shard_1] done [false]
15:02:27:028 In Node 1 The Shard_5 is Rebalanced And Only 580 entity Actors reacreated.
15:02:28:237 In Node 1 The Shard_1 is Rebalanced And Only 61 entity Actors recreated.

15:02:32:967 In Node 1 The Shard_3 is Rebalanced And Only 35 entity Actors recreated.

After Node 2 Restarts Now

Node 1 Shard 1,5,3 with [ 35,580,61 ] entities
Node 2 Down
Node 3 Shard 0,2,4 with [ 600,600,600 ] entities

iii)Node 3 Restarts

15:02:51:133 Node 2 Rejoins the Cluster.
15:02:51:799 Node 3 Leaves the Cluster.
15:02:55:621 In Node 3 : Starting HandOffStopper for shard Shard_0 to terminate 600 entities.
15:02:55:621 In Node 3 : Starting HandOffStopper for shard Shard_2 to terminate 600 entities.
15:02:55:622 In Node 3 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:03:55:772 In Node 2 The Shard_0 is Rebalanced And Only 116 entity Actors recreated.
And Shard_2 & Shard_4 not rebalance to Node 2.
15:04:27:943 Node 3 Rejoins the Cluster

After Node 3 Restarts Now

Node 1 Shard 1,5,3 with [ 35,580,61 ] entities
Node 2 Shard 0 with [ 116 ] entities
Node 3 No Shards.

We have a Singleton actor in the Cluster, which Logs the Current Cluster member Status for a time interval. And it will log the No of Actors in the Cluster by invoking

GetClusterShardingStats getStats = new ShardRegion.GetClusterShardingStats(FiniteDuration.create(5000, TimeUnit.MILLISECONDS)); 
 
Future<Object> ack = Patterns.ask(region, getStats, timeout).toCompletableFuture(); 
 
ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);

With this cluster stats The singlton will Log the Active Actor Count with Shard wise split-up.

Before this sequential Restart it logs

Active Actor Count 3600  &  Members [ Node1:Up,Node2:Up,Node3:Up ]

After this sequential Restart it logs

Members [ Node1:Up,Node2:Up,Node3:Up ]

And it unable to fetch the Active Actors Count from in

ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);

Queries

In my use case the nodes will be restarted frequently like this.
Why the Cluster Sharding rebalance is not properly done here and any clue on what may be the problem here?
Is there any more information needed to further debug this issue?

Solution

As Patrick pointed out on the Lightbend forum, Akka 2.5 is old and there have been many fixes to "Remember Entities" since then. Especially when dealing with large numbers of entities (3,600 isn't absurd but it's definitely enough that I suspect you could just be pushing 2.5 harder than it can sustain in "remember").

I'd suggest following his advice and moving to a recent version of Akka.

Also, many people misunderstand "Remember Entities". Consider if you really need it. Or if you can minimize the number of entities.

EDIT (to respond to comments):

But we need to know this issue occuring bcause of the oldest version or any configuration or design wise we are missing.

First, I just want to point out that you've asked a lot of questions on StackOverflow about Akka where you are struggling with design issues. I really suspect you are at the limits of where free internet advice can help you, you probably need to hire a consultant or get support from Lightbend.

Before I start responding to the question in the comments. One thing I want to reinforce is that your cluster is rebalancing just fine. The thing that is failing is the remembering. Unlike rebalancing, which is essentially just the singleton deciding which shards to put where, "remembering" involves both tracking and restarting individual actors. It's inherently a thundering herd problem and there's no way around it.

You are telling each node to immediately start 600 simultaneous actor recoveries. With each recover going to potentially be reconstructing its state from hundreds of events it has to replay sequentially. TAnd, if you are restarting nodes in rapid sequence you potentially are hitting the database with this thundering herd from multiple nodes at once. hen following each recovery you are updating the DDData eventually consistent database that with the actor state.

It looks like you are only giving the node 30 seconds to complete all 600 of those recoveries (and potentially tens of thousand sof events processed). Which, certainly isn't impossible, that's not an insane rate, but it certainly looks like you restarting the node before it has recovered all remembered entities completely.

Is it a "design issue"? Probably yes. Remember Entities is rarely the right answer. That there 1800 remembered entities feels even worse. 1800 remembered entities in a system where nodes will be "restored frequently" sets off red flags. But I'm just giving free advice on the internet, and haven't seen your business problem or the whole of your design. Maybe I could be persuaded.
Is it a "old version" issue? Probably yes. There have been a lot of improvements to both remember entities and distributed data, especially under stress. And because of the large number of remembered entities and frequent restarts you are putting stress on both systems. Could a newer version of Akka recover faster? Probably.
Is it a "config issue" with remember entities? Probably? Distributed Data is fundamentally an eventually consistent database. Restarting nodes frequently is going to take careful consideration in those circumstances. I'm not sure if you can tweak some of the DData settings for remember entities. Switching to eventsourcing for remember entities might help since event sourcing is strongly consistent. This is definitely one of the areas where I think you may need more help than you can get in free internet advice.
Is it a "config issue" with persistence? Probably? Since you are putting so much stress on the system you will really want to tune those entities to recover quickly. e.g. frequent snapshots
Is it a "code issue"? Maybe? Since you are putting so much stress the system, you'll definitely need to make you are efficient.