google-cloud-automlgcp-ai-platform-training

Can I specify a timeout for a GCP ai-platform training job?


I recently submitted a training job with a command that looked like:

gcloud ai-platform jobs submit training foo --region us-west2 --master-image-uri us.gcr.io/bar:latest -- baz qux

(more on how this command works here: https://cloud.google.com/ml-engine/docs/training-jobs)

There was a bug in my code which cause the job to just keep running, rather than terminate. Two weeks and $61 later, I discovered my error and cancelled the job. I want to make sure I don't make that kind of mistake again.

I'm considering using the timeout command within the training container to kill the process if it takes too long (typical runtime is about 2 or 3 hours), but rather than trust the container to kill itself, I would prefer to configure GCP to kill it externally.

Is there a way to achieve this?


Solution

  • As a workaround, you could write a small script that runs your command and then sleeps the time you want until running a cancel job command.

    As a timeout definition is not available in AI Platform training service, I took the liberty to open a Public Issue with a Feature Request to record the lack of this command. You can track the PI progress here.