I am trying to pass data from one component to the next in Azure ML pipeline. I am able to do it in a simple code.
I have 2 components and I am defining them as below:
components_dir = "."
prep = load_component(source=f"{components_dir}/preprocessing_config.yml")
middle = load_component(source=f"{components_dir}/middle_config.yml")
Then I am defining a pipeline as below:
@pipeline(
display_name="test_pipeline3",
tags={"authoring": "sdk"},
description="test pipeline to test things just like all other test pipelines."
)
def data_pipeline(
# raw_data: Input,
compute_train_node: str,
):
prep_node = prep()
prep_node.outputs.Y_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/")
prep_node.outputs.S_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/")
transform_node = middle(Y_df=prep_node.outputs.Y_df,
S_df=prep_node.outputs.S_df)
The prep node has a script which involves hydra to get in the parameters from a config file. This script also has a config file that kicksoff the script in command line as below:
python preprocessing_script.py
--Y_df ${{outputs.Y_df}}
--S_df ${{outputs.S_df}}
I try to get the values of Y_df.path and S_df.path in the main function of the prep script as below:
@hydra.main(version_base=None, config_path=".", config_name="config_file")
def main(cfg: DictConfig):
parser = argparse.ArgumentParser("prep")
parser.add_argument("--Y_df", type=str, help="Path of prepped data")
parser.add_argument("--S_df", type=str, help="Path of prepped data")
args = parser.parse_args()
# Call the preprocessing function with Hydra configurations
df1,df2 = processing_func(cfg.data_name,cfg.prod_filter)
df1.to_csv(Path(cfg.Y_df) / "Y_df.csv")
df2.to_csv(Path(cfg.S_df) / "S_df.csv")
When I run all of this, I get an error in the prep component itself saying
Execution failed. User process 'python' exited with status code 2. Please check log file 'user_logs/std_log.txt' for error details. Error: /bin/bash: /azureml-envs/azureml_bbh34278yrnrfuehn78340/lib/libtinfo.so.6: no version information available (required by /bin/bash)
usage: data_processing.py [--help] [--hydra-help] [--version]
[--cfg {job,hydra,all}] [--resolve]
[--package PACKAGE] [--run] [--multirun]
[--shell-completion] [--config-path CONFIG_PATH]
[--config-name CONFIG_NAME]
[--config-dir CONFIG_DIR]
[--experimental-rerun EXPERIMENTAL_RERUN]
[--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides ...]
data_processing.py: error: unrecognized arguments: --Y_df --S_df /mnt/azureml/cr/j/ffyh7fs984ryn8f733ff3/cap/data-capability/wd/S_df
The code runs fine and data is transferred between the components when there is no hydra involved but when hydra is involved, I get this error. why is that so?
Edit: Below is the data component config file for prep:
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
name: preprocessing24
display_name: preprocessing24
outputs:
Y_df:
type: uri_folder
S_df:
type: uri_folder
code: ./preprocessing_final
environment: azureml:datapipeline-environment:4
command: >-
python data_processing.py
data preprocessing config file just contains a bunch of variables but I have added 2 more which are:
Y_df:
random_txt
S_df:
random_txt
the main function of the data processing script is mentioned above.
Ok here is what was happening.
This notation in CLI script did not work
python preprocessing_script.py
--Y_df ${{outputs.Y_df}}
--S_df ${{outputs.S_df}}
Thats because hydra does not like that notation (I think)
Instead this notation worked:
python data_processing.py '+Y_df=${{outputs.Y_df}}' '+S_df=${{outputs.S_df}}'
What this does is that it adds those 2 new variables - Y_df and S_df into the config file
These variables can be accessed in the program just like all other variables in the config file by doing cfg.Y_df
or cfg.S_df