[SOLVED] How to make AzureML jobs wait for GPUs to be available on attached Kubernetes compute cluster instead of failing

How to make AzureML jobs wait for GPUs to be available on attached Kubernetes compute cluster instead of failing

I am running an AzureML job on an attached Kubernetes compute cluster on a custom instance type with a resource limit of 2 GPUs.

When I trigger the job, only 1 GPU is available because other jobs use the other GPUs. I want the job to be queued and start when a total of 2 GPUs become available, but instead, I can see the following error in the job Tags:

retry-reason-1 : 03/08/2023 10:45:05 +00:00, FailureMsg: PodPattern matched: {"reason":"UnexpectedAdmissionError","message":"Pod Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 2, Available: 1, which is unexpected"}, FailureCode: -1006

It makes 10 retries, and then the job fails. Is there a way to change this behavior? For example, set up a maximum waiting time to ensure the job is queued for longer and does not fail so fast.

I trigger the job with the az CLI:

az ml job create -f myjob.yaml

And my job definition looks like this:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
experiment_name: my-experiment

command: |
  python myscript.py
  
code: .
environment: azureml:my-environment:1
compute: azureml:my-onprem-compute
resources:
  instance_type: myinstancetypewith2gpus

Solution

I figured out why this was happening, so I will post it here.

In this attached Kubernetes cluster, some engineers scheduled jobs with the default Kubernetes scheduler and some scheduled AzureML jobs. AzureML uses the volcano scheduler to schedule the jobs.

The default scheduler allocated GPU resources, and the volcano scheduler somehow did not have an accurate snapshot of the cluster resource.

Restarting the scheduler by deleting the volcano-scheduler pod fixed the problem: kubectl delete pods -n azureml -lapp=volcano-scheduler

And gathering the logs of the volcano scheduler helped understanding what was happening: kubectl logs -n azureml -lapp=volcano-scheduler -f