I need to set up GPU backed instances on AWS Batch.
Here's my .yaml file:
GPULargeLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
UserData:
Fn::Base64:
Fn::Sub: |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
--==BOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"
runcmd:
- yum install -y aws-cfn-bootstrap
- echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
- echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
- echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
- /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
- echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
- echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
- echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
- echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
- echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
- echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
- /usr/bin/docker-storage-setup
- yum update -y
- echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
- /etc/init.d/docker restart
--==BOUNDARY==--
LaunchTemplateName: GPULargeLaunchTemplate
GPULargeBatchComputeEnvironment:
DependsOn:
- ComputeRole
- ComputeInstanceProfile
Type: AWS::Batch::ComputeEnvironment
Properties:
Type: MANAGED
ComputeResources:
ImageId: ami-GPU-optimized-AMI-ID
AllocationStrategy: BEST_FIT_PROGRESSIVE
LaunchTemplate:
LaunchTemplateId:
Ref: GPULargeLaunchTemplate
Version:
Fn::GetAtt:
- GPULargeLaunchTemplate
- LatestVersionNumber
InstanceRole:
Ref: ComputeInstanceProfile
InstanceTypes:
- g4dn.xlarge
MaxvCpus: 768
MinvCpus: 1
SecurityGroupIds:
- Fn::GetAtt:
- ComputeSecurityGroup
- GroupId
Subnets:
- Ref: ComputePrivateSubnetA
Type: EC2
UpdateToLatestImageVersion: True
MyGPUBatchJobQueue:
Type: AWS::Batch::JobQueue
Properties:
ComputeEnvironmentOrder:
- ComputeEnvironment:
Ref: GPULargeBatchComputeEnvironment
Order: 1
Priority: 5
JobQueueName: MyGPUBatchJobQueue
State: ENABLED
MyGPUJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
Type: container
ContainerProperties:
Command:
- "/opt/bin/python3"
- "/opt/bin/start.py"
- "--retry_count"
- "Ref::batchRetryCount"
- "--retry_limit"
- "Ref::batchRetryLimit"
Environment:
- Name: "Region"
Value: "us-west-2"
- Name: "LANG"
Value: "en_US.UTF-8"
Image:
Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
JobRoleArn:
Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
Memory: 16000
Vcpus: 1
ResourceRequirements:
- Type: GPU
Value: '1'
JobDefinitionName: MyGPUJobDefinition
Timeout:
AttemptDurationSeconds: 500
When I start a job, the job is stuck in RUNNABLE state forever, then I did these:
ImageId
field in my ComputeEnvironment with a known GPU optimized AMI, but still no luck;aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2
, I found that what's missing between them is: containerInstanceArn
and taskArn
where in the non-working GPU instance, these two fields are just missing.Any ideas how to fix this would be greatly appreciated!
This is for sure a great learning, here's what I did and found and resolved this issue:
AWSSupport-TroubleshootAWSBatchJob runbook
which turns out to be helpful (make sure you choose your right region) before running;30T01:19:48Z msg="Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch"
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal systemd[1]: ecs.service: control process exited, code=exited status=255
Mar 30 01:19:48 ip-10-0-163-202.us-west-2.compute.internal kernel: NVRM: API mismatch: the client has the version 535.161.07, but
NVRM: this kernel module has the version 470.182.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
ImageId: ami-019d947e77874eaee
in my template, redeploy, then you could use a few commands to check the status of your GPU EC2 instance:
systemctl status ecs
should be up and running so that your GPU could join your ECS cluster, sudo docker info
should return info which shows that it's running, nvidia-smi
should return info showing that your nvidia driver is properly installed and running, example info:Sat Mar 30 13:47:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 20C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+