[SOLVED] Autoscaling VertexAI pipeline components

Autoscaling VertexAI pipeline components

I am exploring VertexAI pipelines and understand that it is a managed alternative to, say, AI Platform pipelines (where you have to deploy a GKE cluster to be able to run Kubeflow pipelines). What I am not clear on is whether VertexAI will autoscale the cluster depending on the load. In the answer to a similar question, it is mentioned that for pipeline steps that use GCP resources such as Dataflow etc., autoscaling will be done automatically. In the google docs, it is mentioned that for components, one can set resources, such as CPU_LIMIT GPU_LIMIT etc. My question is, can these limits be set for any type of component, i.e., Google Cloud pipeline components or Custom components, whether Python function-based or those packaged as a container image? Secondly, do these limits mean that the components resources will autoscale till they hit those limits? And what happens if these options are not even specified, how are the resources allocated then, will they autoscale as VertexAI sees fit?

Links to relevant docs and resources would be really helpful.

Solution

To answer your questions,

1. Can these limits be set for any type of components?

Yes. Because, these limits are applicable to all Kubeflow components and are not specific to any particular type of component. These components could be implemented to perform tasks with a set amount of resources.

2. Do these limits mean that the component resources will autoscale till they hit the limits?

No, there is no autoscaling performed by Vertex AI. Based on the limits set, Vertex AI chooses one suitable VM to perform the task. Having a pool of workers is supported in Google Cloud Pipeline Components such as “CustomContainerTrainingJobRunOp” and “CustomPythonPackageTrainingJobRunOp” as part of Distributed Training in Vertex AI. Otherwise, only 1 machine is used per step.

3. What happens if these limits are not specified? Does Vertex AI scale the resources as it sees fit?

If the limits are not specified, an “e2-standard-4” VM is used for task execution as the default option.

EDIT: I have updated the links with the latest version of the documentation.