amazon-web-services terraform aws-cloudformation terraform-provider-aws aws-batch

AWS Batch: Activating Previous Revisions marked as inactive

Requirement: Given the requirement for compute resources capable of handling tasks lasting over 15 minutes, we determined that AWS Batch was the optimal solution.

Problem: We support versioning of our application. However, AWS Batch doesn't support versioning like AWS Lambdas. Though, it creates a new JobDefinition with a new incremental revision after each deployment.

Example:

Here, the revision = 58.

However, it marks the previous revision of the JobDefinition as inactive and eligible for deletion automatically.

Like:

When we try to submit Job on any of these previous revision using aws sdk.

Like:

SubmitJobRequest submitJobRequest =
    SubmitJobRequest.builder()
        .jobName("sample")
        .jobQueue("sample")
        .jobDefinition("sample:"+ revision)
        .containerOverrides(
            ContainerOverrides.builder()
                .command(
                    "java",
                    "-cp",
                    "app.jar",
                    "-Dspring.main.web-application-type=none",
                    "-Dloader.main=com.BatchJob",
                    "org.springframework.boot.loader.PropertiesLauncher")
                .build())
        .build();

SubmitJobResponse submitJobResponse = batchClient.submitJob(submitJobRequest);

Exception:

JobDefinition arn:aws:batch:eu-west-1:123456789123:job-definition/sample:57 is not in ACTIVE status. (Service: Batch, Status Code: 400, Request ID: bd6199a1-3978-458c-a84e-19de4c1acb62)

Answers already read:

I've read below answer on the Stack Overflow but none of the answers answered my question:

Any help would be very appreciated!

MORE DETAIL:

We are using terraform as Infrastructure as Code (IaC) and my aws_batch_job_definition look like:

resource "aws_batch_job_definition" "test" {
  name = "sample"
  type = "container"

  platform_capabilities = [
    "FARGATE",
  ]

  container_properties = jsonencode({
    command    = ["echo", "test"]
    image      = "${var.ecr_url_primary}:${local.version}-sample"
    jobRoleArn = aws_iam_role.job_task_role.arn

    fargatePlatformConfiguration = {
      platformVersion = "LATEST"
    }

    resourceRequirements = [
      {
        type  = "VCPU"
        value = "1"
      },
      {
        type  = "MEMORY"
        value = "3072"
      }
    ]
    executionRoleArn = aws_iam_role.execution_role.arn
    environment = [
      {
        name = "SPRING_PROFILES_ACTIVE"
        value = "${tostring(var.env)}"
      }
    ]
  })
  provider = "aws"
}

Solution

Research Summary: If you check the events in the CloudTrail you can easily find the DeregisterJobDefinition events getting triggered. That causes the previous revision of the job-definition to go into inactive state and eligible for deletion after 90 days. Furthermore, the CloudTrail event contains critical information like userAgent that tells you who has caused the event that can be leveraged to delve deeper into the problem.

{
    "eventVersion": "1.09",
.   ...
    "eventSource": "batch.amazonaws.com",
    "eventName": "DeregisterJobDefinition",
    "userAgent": "APN/1.0 HashiCorp/1.0 Terraform/0.13.7 (+https://www.terraform.io) terraform-provider-aws/5.82.1 (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go-v2/1.32.6 ua/2.1 os/linux lang/go#1.23.3 md/GOOS#linux md/GOARCH#amd64 api/batch#1.49.0",
}

How to fix the issue?

The fix is to explicitly add deregister_on_new_revision in the aws_batch_job_definition resource block of your terraform like below:

resource "aws_batch_job_definition" "test" {
  name = "tf_test_batch_job_definition"
  type = "container"

  .. 
  deregister_on_new_revision = false
}

Description:

deregister_on_new_revision - (Optional) When updating a job definition a new revision is created. This parameter determines if the previous version is deregistered (INACTIVE) or left ACTIVE. Defaults to true.