google-cloud-mlkubeflow-pipelinesgoogle-cloud-aigoogle-cloud-ai-platform-pipelines

Vertex Pipeline: CustomPythonPackageTrainingJobRunOp not supplying WorkerPoolSpecs


I am trying to run a custom package training pipeline using Kubeflow pipelines on Vertex AI. I have the training code packaged in Google Cloud Storage and my pipeline is:

import kfp
from kfp.v2 import compiler
from kfp.v2.dsl import component
from kfp.v2.google import experimental
from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip

@kfp.dsl.pipeline(name=pipeline_name, pipeline_root=pipeline_root_path)
def pipeline():
        training_job_run_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
            project=project_id,
            display_name=training_job_name,
            model_display_name=model_display_name,
            python_package_gcs_uri=python_package_gcs_uri,
            python_module=python_module,
            container_uri=container_uri,
            staging_bucket=staging_bucket,
            model_serving_container_image_uri=model_serving_container_image_uri)

        # Upload model
        model_upload_op = gcc_aip.ModelUploadOp(
            project=project_id,
            display_name=model_display_name,
            artifact_uri=output_dir,
            serving_container_image_uri=model_serving_container_image_uri,
        )
        model_upload_op.after(training_job_run_op)

        # Deploy model
        model_deploy_op = gcc_aip.ModelDeployOp(
            project=project_id,
            model=model_upload_op.outputs["model"],
            endpoint=aiplatform.Endpoint(
                endpoint_name='0000000000').resource_name,
            deployed_model_display_name=model_display_name,
            machine_type="n1-standard-2",
            traffic_percentage=100)

    compiler.Compiler().compile(pipeline_func=pipeline,
                                package_path=pipeline_spec_path)

When I try to run this pipeline on Vertex AI I get the following error:

{
  "insertId": "qd9wxrfnoviyr",
  "jsonPayload": {
    "levelname": "ERROR",
    "message": "google.api_core.exceptions.InvalidArgument: 400 List of found errors:\t1.Field: job_spec.worker_pool_specs; Message: At least one worker pool should be specified.\t\n"
  }
}

Solution

  • My original CustomPythonPackageTrainingJobRunOp wasn't defining worker_pool_spec which was the reason for the error. After I specified replica_count and machine_type the error resolved. Final training op is:

    training_job_run_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
                project=project_id,
                display_name=training_job_name,
                model_display_name=model_display_name,
                python_package_gcs_uri=python_package_gcs_uri,
                python_module=python_module,
                container_uri=container_uri,
                staging_bucket=staging_bucket,
                base_output_dir=output_dir,
            model_serving_container_image_uri=model_serving_container_image_uri,
                replica_count=1,
                machine_type="n1-standard-4")