Requirement: Given the requirement for compute resources capable of handling tasks lasting over 15 minutes, we determined that AWS Batch was the optimal solution.
Problem: We support versioning of our application. However, AWS Batch doesn't support versioning like AWS Lambdas. Though, it creates a new JobDefinition
with a new incremental revision
after each deployment.
Example:
Here, the revision = 58
.
However, it marks the previous revision
of the JobDefinition
as inactive
and eligible for deletion automatically.
Like:
When we try to submit Job
on any of these previous revision
using aws sdk.
Like:
SubmitJobRequest submitJobRequest =
SubmitJobRequest.builder()
.jobName("sample")
.jobQueue("sample")
.jobDefinition("sample:"+ revision)
.containerOverrides(
ContainerOverrides.builder()
.command(
"java",
"-cp",
"app.jar",
"-Dspring.main.web-application-type=none",
"-Dloader.main=com.BatchJob",
"org.springframework.boot.loader.PropertiesLauncher")
.build())
.build();
SubmitJobResponse submitJobResponse = batchClient.submitJob(submitJobRequest);
Exception:
JobDefinition arn:aws:batch:eu-west-1:123456789123:job-definition/sample:57 is not in ACTIVE status. (Service: Batch, Status Code: 400, Request ID: bd6199a1-3978-458c-a84e-19de4c1acb62)
Answers already read:
I've read below answer on the Stack Overflow but none of the answers answered my question:
Any help would be very appreciated!
We are using terraform as Infrastructure as Code (IaC) and my aws_batch_job_definition
look like:
resource "aws_batch_job_definition" "test" {
name = "sample"
type = "container"
platform_capabilities = [
"FARGATE",
]
container_properties = jsonencode({
command = ["echo", "test"]
image = "${var.ecr_url_primary}:${local.version}-sample"
jobRoleArn = aws_iam_role.job_task_role.arn
fargatePlatformConfiguration = {
platformVersion = "LATEST"
}
resourceRequirements = [
{
type = "VCPU"
value = "1"
},
{
type = "MEMORY"
value = "3072"
}
]
executionRoleArn = aws_iam_role.execution_role.arn
environment = [
{
name = "SPRING_PROFILES_ACTIVE"
value = "${tostring(var.env)}"
}
]
})
provider = "aws"
}
Research Summary: If you check the events in the CloudTrail you can easily find the DeregisterJobDefinition
events getting triggered. That causes the previous revision of the job-definition
to go into inactive
state and eligible for deletion after 90 days. Furthermore, the CloudTrail event contains critical information like userAgent
that tells you who has caused the event that can be leveraged to delve deeper into the problem.
{
"eventVersion": "1.09",
. ...
"eventSource": "batch.amazonaws.com",
"eventName": "DeregisterJobDefinition",
"userAgent": "APN/1.0 HashiCorp/1.0 Terraform/0.13.7 (+https://www.terraform.io) terraform-provider-aws/5.82.1 (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go-v2/1.32.6 ua/2.1 os/linux lang/go#1.23.3 md/GOOS#linux md/GOARCH#amd64 api/batch#1.49.0",
}
The fix is to explicitly add deregister_on_new_revision
in the aws_batch_job_definition
resource block of your terraform like below:
resource "aws_batch_job_definition" "test" {
name = "tf_test_batch_job_definition"
type = "container"
..
deregister_on_new_revision = false
}
Description:
deregister_on_new_revision
- (Optional) When updating a job definition a new revision is created. This parameter determines if the previous version is deregistered (INACTIVE
) or leftACTIVE
. Defaults to true.