amazon-web-servicesamazon-ecsaws-cdkaws-cdk-typescript

Gracefully shut down ECS service before EC2 instance termination


We're currently facing an issue with our ECS Cluster that's been created with AWS CDK. The cluster uses EC2 capacity providers and we define the machineImage in the EC2 launch template as ecs.EcsOptimizedImage.amazonLinux2(ecs.AmiHardwareType.GPU), which pulls the latest ECS Optimized AMI published in SSM. Our setup includes a few different load-balanced ECS services running within the cluster.

The problem arises during deployments when we deploy using cdk deploy after a new AMI has become available. The EC2 autoscaling group begins spinning up new instances with the updated AMI and starts the rolling update process as expected:

Rolling update initiated. Terminating 2 obsolete instance(s) in batches of 1, while keeping at least 2 instance(s) in service. Pausing for PT8M when new instances are added to the autoscaling group.

After a new instance has been spun up, one of the old instances will be terminated as expected:

Terminating instance(s) [i-xxxxxxx]; replacing with 1 new instance(s).

However, the ECS container instance on the EC2 instance is not deregistered before the EC2 instance gets terminated, which prevents ECS from being notified that services running on that instance will be terminated. This causes the ECS services running on that instance to just die instantly, instead of being handled gracefully. Consequently, the load balancer continues to route some requests to the terminated services until the health checks fail.

We're looking for a way to gracefully shut down the ECS services on an EC2 instance before it gets terminated to avoid serving requests to a terminating instance. Ideally, we want to deregister the container instance from ECS, ensuring that all connections are drained before the EC2 instance is terminated.

Any advice or guidance would be greatly appreciated!


Solution

  • January 2024 update

    ECS Now supports managed instance draining that solves this issue, see: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/managed-instance-draining.html

    Original answer

    I ended up following the advice laid out in this blog post: https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/.

    I attached a lifecycle hook to the Auto Scaling Group, which sends a message to an SNS topic upon instance termination. This, in turn, triggers a Lambda function that initiates a Step Functions execution. The Step Function definition comprises two tasks: a single Lambda function task and a "Wait 10 seconds" task. The Lambda function identifies the container instance running on the EC2 instance that is being shut down and sets its status to DRAINING. It then gets invoked again every 10 seconds until the container instance is either no longer found or has zero running tasks. Once this condition is met, it calls the complete-lifecycle-action API, informing the Auto Scaling Group that the instance can now be safely terminated.