javaakkamarathon

Automatically down nodes in Akka cluster with marathon-api after deployment


I have an application deploying an Akka cluster using marathon-api with ClusterBootstrap

When a deployment runs it does the following:

We have a cluster of 4 nodes

After doing a deployment the cluster looks like this (assuming 2 instances in this example):

{
  "leader": "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655",
  "members": [
    {
      "node": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724",
      "nodeUid": "-1598489963",
      "roles": [
        "dc-default"
      ],
      "status": "Up"
    },
    {
      "node": "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655",
      "nodeUid": "-1604243482",
      "roles": [
        "dc-default"
      ],
      "status": "Up"
    }
  ],
  "oldest": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724",
  "oldestPerRole": {
    "dc-default": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724"
  },
  "selfNode": "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655",
  "unreachable": [
    {
      "node": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724",
      "observedBy": [
        "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655"
      ]
    }
  ]
}

Sometimes the leader remains WeaklyUp but the idea is the same, while gone nodes can be either up or Leaving.

Then the logs start showing this message:

Cluster Node [akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655] - Leader can currently not perform its duties, reachability status: [akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655 -> akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724: Unreachable [Unreachable] (1)], member status: [
akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724 Up seen=false, 
akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655 Up seen=true]

And deploying more times make this even worst

I imagine that when a node is killed then that alters the state of the cluster because then it is in fact not reachable, but I was hoping there could be some kind of feature that will solve this issue

Until now the only thing that works to solve this is to use Akka Cluster HTTP Management doing PUT /cluster/members/{address} operation:Down

I know there was a feature called Auto-downing which was removed because it was doing more harm than good.

I also tried Split Brain Resolver with the strategies provided there, but at the end those just end up downing the complete cluster, with a log like this:

> Cluster Node [akka://app@ip-10-0-5-215.eu-central-1.compute.internal:43211] - Leader can currently not perform its duties, reachability status: [akka://app@ip-10-0-5-215.eu-central-1.compute.internal:43211 -> akka://app@ip-10-0-4-146.eu-central-1.compute.internal:2174: Unreachable [Unreachable] (1)], member status: [akka://app@ip-10-0-4-146.eu-central-1.compute.internal:2174 Up seen=false, akka://app@ip-10-0-5-215.eu-central-1.compute.internal:43211 Up seen=true]
> Running CoordinatedShutdown with reason [ClusterDowningReason]
> SBR is downing
> SBR took decision DownReachable and is downing

Maybe I have not setup the right strategy with the right configuration, but I am not sure what to try, again I have a 4 nodes cluster, so I will guess the default Keep Majority should do it, although this case is more of a crashed node than a network partition

Is there a way to have a smooth deployment of an Akka Cluster using marathon-api ? I am open to suggestions

Update: I was also updating Akka version from 2.5.x to 2.6.x which the documentation states it is not compatible, so I needed to intervene manually the first deployment. At the end using the Split Brain Resolver with default configuration did work


Solution

  • You'll need to use a "real" downing provider like Split Brain Resolver. This lets the cluster safely down nodes that are unreachable. (As opposed to the auto downing, which downs them without consideration of it is safe or not.)

    There's a separate question of why DC/OS is killing the nodes so quickly they don't get the chance to properly shut down. But I don't know DC/OS well enough to say why that could be. And, regardless, a downing provider is essential for clustered environments so you will want to get that in place anyway.

    Edited due to your comments about SBR:

    As an example, what might be happening:

    So, I would recommend the following:

    SBR is the answer here. I realize that you aren't having real network partitions, but the fact that you are having unreachable nodes means that the Akka Cluster is unable to tell if there are network partitions or not and that's the root cause of the problem.