I know that Kubeflow only modifies the container with the specified libraries to be installed. But I want to use my custom module in the training Component section of the pipeline.
So let me clarify my case; I'm deploying a GCP Vertex AI pipeline which exists of preprocessing and training steps. And there is also custom library that I created using some libraries like scikit. My main issue is that I want to re-use that library objects within my training step which looks like;
packages_to_install = [
"pandas",
"sklearn",
"mycustomlibrary?"
],
)
def train_xgb_model(
dataset: Input[Dataset],
model_artifact: Output[Model]
):
from MyCustomLibrary import XGBClassifier
import pandas as pd
data = pd.read_csv(dataset.path)
model = XGBClassifier(
objective="binary:logistic"
)
model.fit(
data.drop(columns=["target"]),
data.target,
)
score = model.score(
data.drop(columns=["target"]),
data.target,
)
model_artifact.metadata["train_score"] = float(score)
model_artifact.metadata["framework"] = "XGBoost"
model.save_model(model_artifact.path)```
One option is to bake your custom module into a custom container image. Then you can use your customer image for the component as:
@component(
base_image='gcr.io/my-custom-image',
packages_to_intall = [
"pandas",
"sklearn",
],
)
def train_xgb_model(...):
...
In fact if you go this route, you might want to bake pandas
and sklearn
into your custom container as well.
Alternatives include hosting your mycustomlibrary
somewhere on the internet, it can be a GitHub repo for instance. And then you can install it as follows:
@component(
packages_to_intall = [
"pandas",
"sklearn",
"git+https://my-repo/mycustomlibrary.git",
],
)
def train_xgb_model(...):
...
Note that what specified in packages_to_install
is passed to pip install
command. And pip
allows installing from various sources. For example:
https://packaging.python.org/en/latest/tutorials/installing-packages/#installing-from-vcs