[SOLVED] How to run multiple GPU-accelerated Training Jobs concurrently on AI-Platform

How to run multiple GPU-accelerated Training Jobs concurrently on AI-Platform

I'm running tensorflow training jobs on AI Platform using the "scaleTier": "BASIC_GPU" setting. My understanding is that this setting uses a single Tesla K80 GPU for my job.

Creating a new job while another job is already running seems to cause the newly created job to be put in a queue until the running job is finished. When I check the logs for the new job, I see this message:

This job is number 1 in the queue and requires 8.000000 CPUs and 1 K80 accelerators. The project is using 8.000000 CPUs out of 450 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 V100, 4 P4, 4 T4, 8 TPU_V2, 8 TPU_V3 allowed across all regions.The project is using 8.000000 CPUs out of 20 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 P4, 1 T4, 1 V100, 8 TPU_V2, 8 TPU_V3 allowed in the region us-central1.

This AI Platform documentation seems to say that my project should be able to use up to 30 K80 GPUs concurrently.

Why is it that I can't even use 2 concurrently?

Do I need to do something to increase my limit to the expected 30?

Solution

For new projects, the default quota will be very low. You can request the more quota increase through this form.