akkaactorakka-clusterakka-remoting

Actor Cluster Sharding Remember entities not properly recreating while Rolling restart of nodes


Problem

In a 3 node Actor cluster, While doing rolling restart of 3 Nodes Remembered entities are not recreating properly.
The Shards are completely rebalanced but Some of the entities not recreated.

Cluster Configurations

akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []

akka.remote.artery{
        enabled = on
        transport = tcp
}

At start all the 3 nodes will have 100 shards in each node with 1000 Actors totally 300 Shards And 3000 Actors.

1.When Node 1 Down Shards on node 1 rebalanced to Node 2 And node 3 with all the remembered entities recreated on those nodes.

2.When Node 1 is Up after few moments Node 2 getting Down .Shards and the Remembered entities on Node 2 is recreated to Node 1.

3.When Node 2 is Up after few moments Node 3 getting down.Shards and the Remembered entities on Node 2 is recreated to Node 2 but some of the entities not recreated to Node 2 from Node 3. All the Shards are rebalanced anyway.

The issue here is

When we restart the Node 3 after the Node 2 joined the Cluster the recreation of Remembered entities is inconsistent.
In mean time there are messages will be send to the Actors on the Cluster.

What can be the bottleneck here when the Node 3 Restarted right after the Node 2 joins?

Tried

1.If we are not restarting the Node 3 there is no issue with the Entities.
2.If we restart the Node 3 alone in rolling restart after some time there is no problem.
3.Increased\decreased Shard count.
4.Changed akka.cluster.distributed-data.majority-min-cap from default 5 to 3 still issue persists.

Is there any configurations need to be tuned?

In which part we need to debug further to find the root cause?


Solution

  • Answer for the Own Question.

    While Debugging Further We found that the issue is with respect to replicating the Remember entities across the nodes in between the frequent restarts.

    Debugging

    enter image description here

    To retrieve the above key-value pairs, we can send a Get message to the Replicator Actor. This message will allow us to fetch the desired key values stored in the Actor Cluster.

    Key<ORSet<String>> rememberEntitiesKey = ORSetKey.create(orSetKey);
    Get<ORSet<String>> getCmd = new Get<ORSet<String>>(rememberEntitiesKey,Replicator.readLocal());
    
    Future<Object> ack = Patterns.ask(replicator, getCmd, timeout1).toCompletableFuture();
    Object result =ack.get(5000, TimeUnit.MILLISECONDS);
    
    Replicator.GetSuccess<ORSet<String>> Orset = (GetSuccess<ORSet<String>>) result;
    The value Contains entities in the Local ddata ---> Orset.dataValue().getElements();
    

    To find out the current state of Remember entities on each node, we can check the local data of that node. By looking at the local data, we can see how the Remember entities are currently stored

    For the case mentioned in the Question the local replicator Rememeber entities data.

    Issue :: UnderReplication in-between the frequent Restart

    Due to frequent restarts, the Replicator may fail to fully replicate the data, resulting in incomplete or inconsistent data in the cluster By tuning the below properties we have resolved this issue

    enter image description here

    Configured to this value

    akka.cluster.sharding.distributed-data {
               gossip-interval = 500 ms        // default 2 s
                notify-subscribers-interval = 100 ms   // default 500 ms
            }
    

    After implementing these changes, even during frequent rolling restarts, the Remembered Entities data is now fully replicated to all nodes and all the entities are recreated successfully.