I am currenlty trying to deploy a Vertex pipeline to achieve the following:
Train a custom model (from a custom training python package) and dump model artifacts (trained model and data preprocessor that will be sed at prediction time). This is step is working fine as I can see new resources being created in the storage bucket.
Create a model resource via ModelUploadOp
. This step fails for some reason when specifying serving_container_environment_variables
and serving_container_ports
with the error message in the errors section below. This is somewhat surprising as they are both needed by the prediction container and environment variables are passed as a dict as specified in the documentation.
This step works just fine using gcloud
commands:
gcloud ai models upload \
--region us-west1 \
--display-name session_model_latest \
--container-image-uri gcr.io/and-reporting/pred:latest \
--container-env-vars="MODEL_BUCKET=ml_session_model" \
--container-health-route=//health \
--container-predict-route=//predict \
--container-ports=5000
There is clearly something that I am getting wrong with Vertex, the components documentation doesn't help much in this case.
from datetime import datetime
import kfp
from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import compiler
PIPELINE_ROOT = "gs://ml_model_bucket/pipeline_root"
@kfp.dsl.pipeline(name="session-train-deploy", pipeline_root=PIPELINE_ROOT)
def pipeline():
training_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
project="my-project",
location="us-west1",
display_name="train_session_model",
model_display_name="session_model",
service_account="name@my-project.iam.gserviceaccount.com",
environment_variables={"MODEL_BUCKET": "ml_session_model"},
python_module_name="trainer.train",
staging_bucket="gs://ml_model_bucket/",
base_output_dir="gs://ml_model_bucket/",
args=[
"--gcs-data-path",
"gs://ml_model_data/2019-Oct_short.csv",
"--gcs-model-path",
"gs://ml_model_bucket/model/model.joblib",
"--gcs-preproc-path",
"gs://ml_model_bucket/model/preproc.pkl",
],
container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest",
python_package_gcs_uri="gs://ml_model_bucket/trainer-0.0.1.tar.gz",
model_serving_container_image_uri="gcr.io/my-project/pred",
model_serving_container_predict_route="/predict",
model_serving_container_health_route="/health",
model_serving_container_ports=[5000],
model_serving_container_environment_variables={
"MODEL_BUCKET": "ml_model_bucket/model"
},
)
model_upload_op = gcc_aip.ModelUploadOp(
project="and-reporting",
location="us-west1",
display_name="session_model",
serving_container_image_uri="gcr.io/my-project/pred:latest",
# When passing the following 2 arguments this step fails...
serving_container_environment_variables={"MODEL_BUCKET": "ml_model_bucket/model"},
serving_container_ports=[5000],
serving_container_predict_route="/predict",
serving_container_health_route="/health",
)
model_upload_op.after(training_op)
endpoint_create_op = gcc_aip.EndpointCreateOp(
project="my-project",
location="us-west1",
display_name="pipeline_endpoint",
)
model_deploy_op = gcc_aip.ModelDeployOp(
model=model_upload_op.outputs["model"],
endpoint=endpoint_create_op.outputs["endpoint"],
deployed_model_display_name="session_model",
traffic_split={"0": 100},
service_account="name@my-project.iam.gserviceaccount.com",
)
model_deploy_op.after(endpoint_create_op)
if __name__ == "__main__":
ts = datetime.now().strftime("%Y%m%d%H%M%S")
compiler.Compiler().compile(pipeline, "custom_train_pipeline.json")
pipeline_job = aiplatform.PipelineJob(
display_name="session_train_and_deploy",
template_path="custom_train_pipeline.json",
job_id=f"session-custom-pipeline-{ts}",
enable_caching=True,
)
pipeline_job.submit()
serving_container_environment_variables
and serving_container_ports
the step fails with the following error:{'code': 400, 'message': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.\nInvalid value at \'model.container_spec.ports[0]\' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000', 'status': 'INVALID_ARGUMENT', 'details': [{'@type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'model.container_spec.env[0]', 'description': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.'}, {'field': 'model.container_spec.ports[0]', 'description': "Invalid value at 'model.container_spec.ports[0]' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000"}]}]}
When commenting out serving_container_environment_variables
and serving_container_ports
the model resource gets created but deploying it manually to the endpoint results into a failed deployment with no output logs.
After some time researching the problem I've stumbled upon this Github issue. The problem was originated by a mismatch between google_cloud_pipeline_components
and kubernetes_api
docs. In this case, serving_container_environment_variables
is typed as an Optional[dict[str, str]]
whereas it should have been typed as a Optional[list[dict[str, str]]]
. A similar mismatch can be found for serving_container_ports
argument as well. Passing arguments following kubernetes documentation did the trick:
model_upload_op = gcc_aip.ModelUploadOp(
project="my-project",
location="us-west1",
display_name="session_model",
serving_container_image_uri="gcr.io/my-project/pred:latest",
serving_container_environment_variables=[
{"name": "MODEL_BUCKET", "value": "ml_session_model"}
],
serving_container_ports=[{"containerPort": 5000}],
serving_container_predict_route="/predict",
serving_container_health_route="/health",
)