google-cloud-platformgoogle-cloud-mlgcp-ai-platform-training

How to run multiple GPU-accelerated Training Jobs concurrently on AI-Platform


I'm running tensorflow training jobs on AI Platform using the "scaleTier": "BASIC_GPU" setting. My understanding is that this setting uses a single Tesla K80 GPU for my job.

Creating a new job while another job is already running seems to cause the newly created job to be put in a queue until the running job is finished. When I check the logs for the new job, I see this message:

This job is number 1 in the queue and requires 8.000000 CPUs and 1 K80 accelerators. The project is using 8.000000 CPUs out of 450 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 V100, 4 P4, 4 T4, 8 TPU_V2, 8 TPU_V3 allowed across all regions.The project is using 8.000000 CPUs out of 20 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 P4, 1 T4, 1 V100, 8 TPU_V2, 8 TPU_V3 allowed in the region us-central1.

This AI Platform documentation seems to say that my project should be able to use up to 30 K80 GPUs concurrently.

Why is it that I can't even use 2 concurrently?

Do I need to do something to increase my limit to the expected 30?


Solution

  • For new projects, the default quota will be very low. You can request the more quota increase through this form.