pythongoogle-cloud-mlkubeflow-pipelinesgoogle-cloud-vertex-ai

ModelUploadOp step failing with custom prediction container


I am currenlty trying to deploy a Vertex pipeline to achieve the following:

  1. Train a custom model (from a custom training python package) and dump model artifacts (trained model and data preprocessor that will be sed at prediction time). This is step is working fine as I can see new resources being created in the storage bucket.

  2. Create a model resource via ModelUploadOp. This step fails for some reason when specifying serving_container_environment_variables and serving_container_ports with the error message in the errors section below. This is somewhat surprising as they are both needed by the prediction container and environment variables are passed as a dict as specified in the documentation.
    This step works just fine using gcloud commands:

gcloud ai models upload \
    --region us-west1 \
    --display-name session_model_latest \
    --container-image-uri gcr.io/and-reporting/pred:latest \
    --container-env-vars="MODEL_BUCKET=ml_session_model" \
    --container-health-route=//health \
    --container-predict-route=//predict \
    --container-ports=5000
  1. Create an endpoint.
  2. Deploy the model to the endpoint.

There is clearly something that I am getting wrong with Vertex, the components documentation doesn't help much in this case.

Pipeline

from datetime import datetime

import kfp
from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import compiler

PIPELINE_ROOT = "gs://ml_model_bucket/pipeline_root"


@kfp.dsl.pipeline(name="session-train-deploy", pipeline_root=PIPELINE_ROOT)
def pipeline():
    training_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
        project="my-project",
        location="us-west1",
        display_name="train_session_model",
        model_display_name="session_model",
        service_account="name@my-project.iam.gserviceaccount.com",
        environment_variables={"MODEL_BUCKET": "ml_session_model"},
        python_module_name="trainer.train",
        staging_bucket="gs://ml_model_bucket/",
        base_output_dir="gs://ml_model_bucket/",
        args=[
            "--gcs-data-path",
            "gs://ml_model_data/2019-Oct_short.csv",
            "--gcs-model-path",
            "gs://ml_model_bucket/model/model.joblib",
            "--gcs-preproc-path",
            "gs://ml_model_bucket/model/preproc.pkl",
        ],
        container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest",
        python_package_gcs_uri="gs://ml_model_bucket/trainer-0.0.1.tar.gz",
        model_serving_container_image_uri="gcr.io/my-project/pred",
        model_serving_container_predict_route="/predict",
        model_serving_container_health_route="/health",
        model_serving_container_ports=[5000],
        model_serving_container_environment_variables={
            "MODEL_BUCKET": "ml_model_bucket/model"
        },
    )

    model_upload_op = gcc_aip.ModelUploadOp(
        project="and-reporting",
        location="us-west1",
        display_name="session_model",
        serving_container_image_uri="gcr.io/my-project/pred:latest",
        # When passing the following 2 arguments this step fails...
        serving_container_environment_variables={"MODEL_BUCKET": "ml_model_bucket/model"},
        serving_container_ports=[5000],
        serving_container_predict_route="/predict",
        serving_container_health_route="/health",
    )
    model_upload_op.after(training_op)

    endpoint_create_op = gcc_aip.EndpointCreateOp(
        project="my-project",
        location="us-west1",
        display_name="pipeline_endpoint",
    )

    model_deploy_op = gcc_aip.ModelDeployOp(
        model=model_upload_op.outputs["model"],
        endpoint=endpoint_create_op.outputs["endpoint"],
        deployed_model_display_name="session_model",
        traffic_split={"0": 100},
        service_account="name@my-project.iam.gserviceaccount.com",
    )
    model_deploy_op.after(endpoint_create_op)


if __name__ == "__main__":
    ts = datetime.now().strftime("%Y%m%d%H%M%S")
    compiler.Compiler().compile(pipeline, "custom_train_pipeline.json")
    pipeline_job = aiplatform.PipelineJob(
        display_name="session_train_and_deploy",
        template_path="custom_train_pipeline.json",
        job_id=f"session-custom-pipeline-{ts}",
        enable_caching=True,
    )
    pipeline_job.submit()

Errors and notes

  1. When specifying serving_container_environment_variables and serving_container_ports the step fails with the following error:
{'code': 400, 'message': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.\nInvalid value at \'model.container_spec.ports[0]\' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000', 'status': 'INVALID_ARGUMENT', 'details': [{'@type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'model.container_spec.env[0]', 'description': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.'}, {'field': 'model.container_spec.ports[0]', 'description': "Invalid value at 'model.container_spec.ports[0]' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000"}]}]}

When commenting out serving_container_environment_variables and serving_container_ports the model resource gets created but deploying it manually to the endpoint results into a failed deployment with no output logs.


Solution

  • After some time researching the problem I've stumbled upon this Github issue. The problem was originated by a mismatch between google_cloud_pipeline_components and kubernetes_api docs. In this case, serving_container_environment_variables is typed as an Optional[dict[str, str]] whereas it should have been typed as a Optional[list[dict[str, str]]]. A similar mismatch can be found for serving_container_ports argument as well. Passing arguments following kubernetes documentation did the trick:

    model_upload_op = gcc_aip.ModelUploadOp(
        project="my-project",
        location="us-west1",
        display_name="session_model",
        serving_container_image_uri="gcr.io/my-project/pred:latest",
        serving_container_environment_variables=[
            {"name": "MODEL_BUCKET", "value": "ml_session_model"}
        ],
        serving_container_ports=[{"containerPort": 5000}],
        serving_container_predict_route="/predict",
        serving_container_health_route="/health",
    )