google-cloud-platformgoogle-cloud-mlgcp-ai-platform-training

Is there a way to be notified of status changes in Google AI Platform training jobs without polling the REST API?


Right now I monitor my submitted jobs on Google AI Platform (formerly ml engine) by polling the job REST API. I don't like this solution for a few reasons:

  1. Awareness of status changes is often delayed or missed altogether if the interval between status changes is smaller than the monitoring polling rate
  2. Lots of unnecessary network traffic
  3. Lots of unnecessary function invocations

I would like to be notified as soon as my training jobs complete. It'd be great if there is some way to assign hooks or callbacks to run when the job status changes.

I've also considered adding calls to cloud functions directly within the training task python package that runs on AI Platform. However, I don't think those function calls will occur in cases where the training job is shutdown unexpectedly, such as when a job is cancelled or forced to end by GCP.

Is there a better way to go about this?


Solution

  • You can use a Stackdriver sink to read the logs and send it to Pub/Sub. From Pub/Sub, you can connect to a bunch of other providers:

    1. Set up a Pub/Sub sink

    Make sure you have access to the logs and publish rights to the topic you desire before you get started. Follow the instructions for setting up a Stackdriver -> Pub/Sub sink. You’ll want to use this query to limit the events only to Training jobs:

    resource.type = "ml_job"
    resource.labels.task_name = "service"
    

    Note that Stackdriver can further limit down the query. For example, you can limit to a particular Job by adding a condition like resource.labels.job_id = "..." or to a certain event with a filter like jsonPayload.message : "..."

    2. Respond to the Pub/Sub message

    In order to tell what changed, the recipient of the Pub/Sub message can either query the job status from the ml.googleapis.com API or read the text of the message

    Reading state from ml.googleapis.com

    When you receive the message, make a call to https://ml.googleapis.com/v1/<project_id>/jobs/<job_id> to get the Job information, replacing [project_id] and [job_id] in the URL with the values of resource.label.project_id and resource.label.job_id from the Pub/Sub message, respectively.

    The returned Job object contains a field state that, naturally, tells the status of the job.

    Reading state from the message text

    The Pub/Sub message will contain a string telling what happened to the job. You probably want behavior when the job ends. Look for these strings in jsonPayload.message: