I have an ECS cluster backed by EC2 machines in an autoscaling group.
The cluster uses capacity provider described in CloudFormation with the following code:
CapacityProvider:
Type: AWS::ECS::CapacityProvider
Condition: EnableInstanceAutoScaling
Properties:
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref InstanceAutoScalingGroup
ManagedScaling:
MaximumScalingStepSize: 10
MinimumScalingStepSize: 1
Status: ENABLED
TargetCapacity: 100
ManagedTerminationProtection: ENABLED
Notice that both ManagedScaling
and ManagedTerminationProtection
are ENABLED
.
Now, following this I also set NewInstancesProtectedFromScaleIn
to true
:
If managed termination protection is enabled when you create a capacity provider, the Auto Scaling group and each Amazon EC2 instance in the Auto Scaling group must have instance protection from scale in enabled as well.
It all works fine, but sometimes the EC2 instances are stuck inside ASG:
It doesn't happen to all the instances, only to some and I have no idea which ones. I don't have any lifecycle hooks. This leads to the ASG getting filled with unused resources (equals money) up to the point when it cannot scale out anymore, cause it has reached the maximum capacity.
Then I also found this post about similar problem with Batch, where the suggested answer was to disable the ASG Scale-in protection.
Any suggestions on how I can diagnose/fix the problem?
*P.S. During this the ASG will have desired capaticy set to e.g. 1 and actively trying to scale in.
Hi I'm an AWS employee from the team that works on ECS.
This is a known issue caused by the fact that ECS does not immediately release an EC2 instance from termination protection as soon as the instance is no longer running tasks. There is a cooldown delay after which ECS asynchronously releases the instance from termination protection and allows it to be stopped. Normally this cooldown delay is very good as it prevents you from constantly churning EC2 instances when you could just keep the EC2 instances around for a couple minutes and then reuse the same instance for the next task that needs to be launched.
However when tearing down the CloudFormation stack, CloudFormation will delete the AWS::ECS::Service
and immediately move on to tearing down the AWS::ECS::Cluster
, disconnecting the AWS::AutoScaling::AutoScalingGroup
from ECS management too fast, before ECS has a chance to asynchronously turn off managed instance protection on the EC2 instances.
This will leave some EC2 instances stranded in a state where they are protected from scale-in forever. This then blocks the AWS::AutoScaling::AutoScalingGroup
from cleaning itself up.
Fortunately, I have an automated solution for you. You can use a custom resource function that force destroys the autoscaling group when tearing down the stack, avoiding the issue of protected EC2 instances that can never be cleaned up.
You can find a full reference architecture with instructions here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling
Or just look at the code here on Github: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123