amazon-web-servicesterraformamazon-ecs

AWS ECS creates two instances


I am struggling with configuring an ECS that has a desired capacity of 1. I have it configured with a load balancer. These are the main settings of my terraform file:

esource "aws_autoscaling_group" "public_ecs_asg" {
    name = var.public_ecs.asg.name

    vpc_zone_identifier = [aws_subnet.main_vpc_public_subnet_1.id, aws_subnet.main_vpc_public_subnet_2.id]

    min_size            = var.public_ecs.asg.ec2_min_instances
    max_size            = var.public_ecs.asg.ec2_max_instances
    desired_capacity    = 1
}


resource "aws_ecs_service" "public_ecs_service" {
    name            = var.public_ecs.service_name
    cluster         = aws_ecs_cluster.public_ecs_cluster.id
    task_definition = aws_ecs_task_definition.public_ecs_task_definition.arn
    desired_count   = 1
}

resource "aws_ecs_capacity_provider" "public_ecs_capacity_provider" {
    name = var.public_ecs.capacity_provider_name

    auto_scaling_group_provider {
        auto_scaling_group_arn = aws_autoscaling_group.public_ecs_asg.arn

        managed_scaling {
            maximum_scaling_step_size = 1
            minimum_scaling_step_size = 1
            status                    = "ENABLED"
            target_capacity           = 1
        }
    }
}

when Terraform is being created, it creates an EC2 instance before ECS is created (which I think is the main issue). Then, it automatically creates an ASG Dynamic Scaling Policy with Target 1 as defined previously:

{
  "CustomizedMetricSpecification": {
    "MetricName": "CapacityProviderReservation",
    "Namespace": "AWS/ECS/ManagedScaling",
    "Dimensions": [
      {
        "Name": "CapacityProviderName",
        "Value": "public-ecs-capacity-provider"
      },
      {
        "Name": "ClusterName",
        "Value": "public-ecs-cluster"
      }
    ],
    "Statistic": "Average"
  }
}

This policy is triggered by a CloudWatch alarm also created by AWS that states: "Threshold CapacityProviderReservation > 1 for 1 datapoints within 1 minute"

This Watch doesn't ever change to OK status, it's always in alarm.

This is the full terraform config file https://pastebin.com/CqKX8VTm

What am I missing?


Solution

  • It can be the issue that the ASG takes time to launch the first instance, so when the ECS try to create a task it find out that there's no heathy instance ready, that's why it asks for one more. and depending on how much the tasks are consuming from each instance and how your ASG metric configuration, it will choose wether to remove one instance or keep them both. for example if your task is already consuming 60% and then your ASG metric only scale down at 50%, then both will be there. The configuration below will make sure the ECS service will be created only when one instance is ready

    resource "null_resource" "wait_for_asg_instances" {
      provisioner "local-exec" {
        command = <<EOT
          while [ "$(aws ecs list-container-instances --cluster ${aws_ecs_cluster.public_ecs_cluster.name} --query 'containerInstanceArns' --output text)" == "" ]; do
            echo "Waiting for ECS instances to register..."
            sleep 10
          done
        EOT
      }
      depends_on = [aws_autoscaling_group.public_ecs_asg]
    }
    
    resource "aws_ecs_service" "public_ecs_service" {
      name            = var.public_ecs.service_name
      cluster         = aws_ecs_cluster.public_ecs_cluster.id
      task_definition = aws_ecs_task_definition.public_ecs_task_definition.arn
      desired_count   = 1
    
      capacity_provider_strategy {
        capacity_provider = aws_ecs_capacity_provider.public_ecs_capacity_provider.name
      }
    
      depends_on = [null_resource.wait_for_asg_instances]
    }
    

    The issue stems from a lack of synchronization between the Auto Scaling Group (ASG) and the ECS service. The ECS service relies on the ASG to provide sufficient capacity, but if the scaling rules or metrics in the ASG are misaligned, problems can occur. Additionally, if an ECS task takes longer than expected to start, the ECS service might interpret this delay as a failure and attempt to launch a second task, which can result in both tasks running simultaneously.