kubeflow-pipelines

How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines


I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.

I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.

Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.

@dsl.pipelines(name='test_pipeline')
def pipeline():
    pp = create_component_from_func(func=_pre_process_data)()
    # use pp['pre_processed']...

def pre_process_data(pre_processed_path: OutputPath('csv')):
    import os

    print('do some processing which saves file to data/pre_processed.csv')

    # want to avoid this:
    print('move files to OutputPath locations...')
    os.rename(f'data/pre_processed.csv', pre_processed_path)

Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.

Thanks!


Solution

  • Update - See ark-kun's comment, the approach in my original answer is deprecated and should not be used. It is better to let Kubeflow Pipelines specify where you should store your pipeline's artifacts.


    For lightweight components (such as the one in your example), Kubeflow Pipelines builds the container image for your component and specifies the paths for inputs and outputs (based upon the types you use to decorate your component function). I would recommend using those paths directly, instead of writing to one location and then renaming the file. The Kubeflow Pipelines samples follow this pattern.

    For reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component. In that case you can specify your preferred location for the output files. That being said, reusable components take a bit more effort to create, since you need to build a Docker container image and component specification in YAML.