I'm running an AzureML pipeline using the command line where the sole job (for now) is a sweep.
When I run
run_id=$(az ml job create -f path_to_pipeline/pipeline.yaml --query name -o tsv -g grp_name -w ws-name)
,
I get the following error:
ERROR: Met error <class 'Exception'>:{
"result": "Failed",
"errors": [
{
"message": "Invalid data binding expression: inputs.data, outputs.model_output, search_space.batch_size, search_space.learning_rate",
"path": "command",
"value": "python train.py --data_path ${{inputs.data}} --output_path ${{outputs.model_output}} --batch_size ${{search_space.batch_size}} --learning_rate ${{search_space.learning_rate}}"
}
]
}
The pipeline yaml looks like this:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: pipeline_with_hyperparameter_sweep
description: Tune hyperparameters
settings:
default_compute: azureml:compute-name # sub with your compute name
jobs:
sweep_step:
type: sweep
inputs:
data:
type: uri_file
path: azureml:code_train_data:1 #data store I created
outputs:
model_output:
sampling_algorithm: random
search_space:
batch_size:
type: choice
values: [1, 5, 10, 15]
learning_rate:
type: loguniform
min_value: -6.90775527898 # ln(0.001)
max_value: -2.30258509299 # ln(0.1)
trial:
code: ../src
command: >-
python train.py
--data_path ${{inputs.data}}
--output_path ${{outputs.model_output}}
--batch_size ${{search_space.batch_size}}
--learning_rate ${{search_space.learning_rate}}
environment: azureml:env_finetune_component:1
objective:
goal: maximize
primary_metric: bleu_score
limits:
max_total_trials: 5
max_concurrent_trials: 3
timeout: 3600
trial_timeout: 720
For the train.py
file, note that I of course have a lot of actual code in in the main function, but I commented it out with pass to check if it makes a difference and the error is the same. So the problem is upstream with the bindings, not what's inside of train.
import argparse
def main(args):
pass
def parse_args():
parser = argparse.ArgumentParser()
parser.add_arguments("--data_path")
parser.add_arguments("--output_path")
parser.add_arguments("--batch_size", type=int)
parser.add_arguments("--learning_rate", type=float)
args = parser.parse_args()
return args
if __name__ == "__main__":
args = parse_args()
main(args)
If helpful, here's output when I run az version
:
{
"azure-cli": "2.53.0",
"azure-cli-core": "2.53.0",
"azure-cli-telemetry": "1.1.0",
"extensions": {
"ml": "2.20.0"
}
}
I found the solution. The pipeline.yaml
syntax for trial is in fact just
trial: filename.yaml
:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: codegen_sweep
description: Tune hyperparameters
settings:
default_compute: azureml:roma2
jobs:
sweep_step:
type: sweep
inputs:
data_path:
type: uri_file
path: azureml:code_train_data:1
outputs:
model_output:
sampling_algorithm: random
search_space:
batch_size:
type: choice
values: [1, 5, 10, 15]
learning_rate:
type: loguniform
min_value: -6.90775527898 # ln(0.001)
max_value: -2.30258509299 # ln(0.1)
trial: ./train.yaml
objective:
goal: maximize
primary_metric: eval_bleu_score # how mlflow outputs in other models
limits:
max_total_trials: 5
max_concurrent_trials: 3
timeout: 3600 # 1 hour
trial_timeout: 720 # 20 mins
There was another problem. In the train.yaml
file, my source directory is parallel, so I needed to specify using ../src
:
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
name: train_model
display_name: train_model
version: 1
inputs:
data_path:
type: uri_file
batch_size:
type: integer
learning_rate:
type: number
outputs:
model_output:
type: mlflow_model
code: ../src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
command: >-
python train.py
--data_path ${{inputs.data_path}}
--output_path ${{outputs.model_output}}
--batch_size ${{inputs.batch_size}}
--learning_rate ${{inputs.learning_rate}}
Note I simplified the arguments just to focus on getting this to work. Additionally, I fixed the parser.add_arguments
as per one of the comments.