google-cloud-platformgoogle-cloud-buildgcp-ai-platform-training

How to submit a GCP AI Platform training job frominside a GCP Cloud Build pipeline?


I have a pretty standard CI pipeline using Cloud Build for my Machine Learning training model based on container:

Now in Machine Learning it is impossible to validate a model without testing it with real data. Normally we add 2 extra checks:

This allow to catch issues inside the code of model. In my setup, I have my Cloud Build in a build GCP project and the data in another GCP project.

Q1: did somebody managed to use AI Platform training service in Cloud Build to train on data sitting in another GCP project ?

Q2: how to tell Cloud Build to wait until the AI Platform training job finished and check what is the status (successful/failed) ? It seems that the only option when looking at the documentation link it to use --stream-logsbut it seems non optimal (using such option, I saw some huge delay)


Solution

  • When you submit an AI platform training job, you can specify a service account email to use.

    Be sure that the service account has enough authorization in the other project to use data from there.

    For you second question, you have 2 solutions

    - name: name: 'gcr.io/cloud-builders/gcloud'
      entrypoint: 'bash'
      args:
        - -c
        - |
             gcloud ai-platform jobs submit training <your params> --stream-logs >/dev/null 2>/dev/null
    
    

    Or you can create an infinite loop that check the status

    - name: name: 'gcr.io/cloud-builders/gcloud'
      entrypoint: 'bash'
      args:
        - -c
        - |
            JOB_NAME=<UNIQUE Job NAME>
            gcloud ai-platform jobs submit training $${JOB_NAME} <your params> 
            # test the job status every 60 seconds
            while [ -z "$$(gcloud ai-platform jobs describe $${JOB_NAME} | grep SUCCEEDED)" ]; do sleep 60; done
    

    Here my test is simple, but you can customize the status tests as you want to match your requirement

    Don't forget to set the timeout as expected.