amazon-web-services amazon-ec2 amazon-ecs aws-auto-scaling

AWS AutoScaling group managed by ECS cluster capacity provider cannot scale in due to protection

I have an ECS cluster backed by EC2 machines in an autoscaling group.

The cluster uses capacity provider described in CloudFormation with the following code:

  CapacityProvider:
    Type: AWS::ECS::CapacityProvider
    Condition: EnableInstanceAutoScaling
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref InstanceAutoScalingGroup
        ManagedScaling:
          MaximumScalingStepSize: 10
          MinimumScalingStepSize: 1
          Status: ENABLED
          TargetCapacity: 100
        ManagedTerminationProtection: ENABLED

Notice that both ManagedScaling and ManagedTerminationProtection are ENABLED.

Now, following this I also set NewInstancesProtectedFromScaleIn to true:

If managed termination protection is enabled when you create a capacity provider, the Auto Scaling group and each Amazon EC2 instance in the Auto Scaling group must have instance protection from scale in enabled as well.

It all works fine, but sometimes the EC2 instances are stuck inside ASG:

they are unregistered from the ECS Cluster (aka not listed there anymore);
they still have scale-in protection enabled;
ASG cannot terminate them:

It doesn't happen to all the instances, only to some and I have no idea which ones. I don't have any lifecycle hooks. This leads to the ASG getting filled with unused resources (equals money) up to the point when it cannot scale out anymore, cause it has reached the maximum capacity.

Then I also found this post about similar problem with Batch, where the suggested answer was to disable the ASG Scale-in protection.

Any suggestions on how I can diagnose/fix the problem?

*P.S. During this the ASG will have desired capaticy set to e.g. 1 and actively trying to scale in.

Solution

Hi I'm an AWS employee from the team that works on ECS.

This is a known issue caused by the fact that ECS does not immediately release an EC2 instance from termination protection as soon as the instance is no longer running tasks. There is a cooldown delay after which ECS asynchronously releases the instance from termination protection and allows it to be stopped. Normally this cooldown delay is very good as it prevents you from constantly churning EC2 instances when you could just keep the EC2 instances around for a couple minutes and then reuse the same instance for the next task that needs to be launched.

However when tearing down the CloudFormation stack, CloudFormation will delete the AWS::ECS::Service and immediately move on to tearing down the AWS::ECS::Cluster, disconnecting the AWS::AutoScaling::AutoScalingGroup from ECS management too fast, before ECS has a chance to asynchronously turn off managed instance protection on the EC2 instances.

This will leave some EC2 instances stranded in a state where they are protected from scale-in forever. This then blocks the AWS::AutoScaling::AutoScalingGroup from cleaning itself up.

Fortunately, I have an automated solution for you. You can use a custom resource function that force destroys the autoscaling group when tearing down the stack, avoiding the issue of protected EC2 instances that can never be cleaned up.

You can find a full reference architecture with instructions here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling

Or just look at the code here on Github: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123