akkaakka-clusterakka-remote-actorakka-remoting

Actor Cluster with WeaklyUp Members making the cluster too slow to respond


Need some clarification about Auto Downing & WeaklyUp Members.

We send data from proxy to Cluster Node.Before sending to the Entity Actor we check if that Actor is live or not by invoking Actor Selection.

Firstly we have use Auto-down when unreachable for downing a node which is unreachable. But it destroys the Cluster some time. So we turned it off.

When all the nodes are Up Actor selection is very fast (<10 ms) & we can send the data from Proxy to Cluster very fast.
If any node restarted it was joining as WeaklyUp because new port is allocated.
If any weaklyUp member is available in Cluster the Actor Selection is taking more than 20+ Sec. So sendingg data to Cluster is too Slow.

What is the behaviour here?
How can we avoid this?
Why the WeaklyUp member is making the CLuster slow.?


Solution

  • We send data from proxy to Cluster Node.

    I don't really understand what you mean by this. I also don't understand what you mean by "cluster node" and "node proxy" node. I think you mean that you are using node roles to have only two nodes participate in entity Sharding (and the other three only having proxies). This doesn't seem like a good design to me because of how small your cluster is, but I don't think it's directly related to your problem.

    Before sending to the Entity Actor we check if that Actor is live or not by invoking Actor Selection.

    Are you querying each possible node individually? (Since actor selection would include the node.) This seems like a very bad idea for multiple reasons.

    Firstly we have use Auto-down when unreachable for downing a node which is unreachable. But it destroys the Cluster some time. So we turned it off.

    Based on other comments from other questions, I believe you mean down-all-when-unstable. This is a sign that your network stability is very, very, very bad. This setting is a failsafe in the clustering that basically says "if the network is so unreliable that there is no safe way to continue, shut down the cluster".

    I would never recommend running a cluster with this setting off, as:

    A) It's there to ensure safety. If you turn it off you are inviting inconsistency.

    B) If it triggering there are massive problems in your networking that need to be resolved.

    If any node restarted it was joining as WeaklyUp because new port is allocated.

    I'm not sure what you mean "because new port is allocated". By definition when a new node is rejoining it already has a new port because it has to communicate that port as part of joining.

    Bu regardless, WeaklyUp isn't caused by anything like that. It's caused when the leader has recognized a new node but hasn't yet established consensus. With a huge cluster this can be somewhat normal as the new node information is propagated. But with a tiny cluster like yours, if this state persists more than a few milliseconds it's another sign that you are having massive network problems that are causing the nodes to be unable to consistently share gossip.

    If I were trying to troubleshooting your systems, I want all of the cluster logs. But, from the information given, all signs seem to point towards a problem with the networking at the OS level and Akka is just struggling to remain consistent in the face of that underlying network instability.