Invalid data binding expression when running AzureML pipeline

I'm running an AzureML pipeline using the command line where the sole job (for now) is a sweep.

When I run run_id=$(az ml job create -f path_to_pipeline/pipeline.yaml --query name -o tsv -g grp_name -w ws-name), I get the following error:

ERROR: Met error <class 'Exception'>:{
  "result": "Failed",
  "errors": [
    {
      "message": "Invalid data binding expression: inputs.data, outputs.model_output, search_space.batch_size, search_space.learning_rate",
      "path": "command",
      "value": "python train.py --data_path ${{inputs.data}} --output_path ${{outputs.model_output}} --batch_size ${{search_space.batch_size}} --learning_rate ${{search_space.learning_rate}}"
    }
  ]
}

The pipeline yaml looks like this:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: pipeline_with_hyperparameter_sweep
description: Tune hyperparameters
settings:
  default_compute: azureml:compute-name  # sub with your compute name
jobs:
  sweep_step:
    type: sweep
    inputs:
      data:
        type: uri_file
        path: azureml:code_train_data:1  #data store I created
    outputs:
      model_output:
    sampling_algorithm: random
    search_space:
      batch_size:
        type: choice
        values: [1, 5, 10, 15]
      learning_rate:
        type: loguniform
        min_value: -6.90775527898 # ln(0.001)
        max_value: -2.30258509299 # ln(0.1)
    trial:
      code: ../src
      command: >-
        python train.py 
        --data_path ${{inputs.data}} 
        --output_path ${{outputs.model_output}} 
        --batch_size ${{search_space.batch_size}} 
        --learning_rate ${{search_space.learning_rate}}
      environment: azureml:env_finetune_component:1
    objective:
      goal: maximize
      primary_metric: bleu_score
    limits:
      max_total_trials: 5
      max_concurrent_trials: 3
      timeout: 3600
      trial_timeout: 720

For the train.py file, note that I of course have a lot of actual code in in the main function, but I commented it out with pass to check if it makes a difference and the error is the same. So the problem is upstream with the bindings, not what's inside of train.

import argparse

def main(args):
    pass

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_arguments("--data_path")
    parser.add_arguments("--output_path")
    parser.add_arguments("--batch_size", type=int)
    parser.add_arguments("--learning_rate", type=float)
    args = parser.parse_args()

    return args


if __name__ == "__main__":

    args = parse_args()

    main(args)

If helpful, here's output when I run az version:

{
  "azure-cli": "2.53.0",
  "azure-cli-core": "2.53.0",
  "azure-cli-telemetry": "1.1.0",
  "extensions": {
    "ml": "2.20.0"
  }
}

Solution

I found the solution. The pipeline.yaml syntax for trial is in fact just trial: filename.yaml:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: codegen_sweep
description: Tune hyperparameters
settings:
  default_compute: azureml:roma2
jobs:
  sweep_step:
    type: sweep
    inputs:
      data_path:
        type: uri_file
        path: azureml:code_train_data:1
    outputs:
      model_output:
    sampling_algorithm: random
    search_space:
      batch_size:
        type: choice
        values: [1, 5, 10, 15]
      learning_rate:
        type: loguniform
        min_value: -6.90775527898 # ln(0.001)
        max_value: -2.30258509299 # ln(0.1)
    trial: ./train.yaml
    objective:
      goal: maximize
      primary_metric: eval_bleu_score # how mlflow outputs in other models
    limits:
      max_total_trials: 5
      max_concurrent_trials: 3
      timeout: 3600 # 1 hour
      trial_timeout: 720 # 20 mins

There was another problem. In the train.yaml file, my source directory is parallel, so I needed to specify using ../src:

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

name: train_model
display_name: train_model
version: 1

inputs:
  data_path:
    type: uri_file
  batch_size:
    type: integer
  learning_rate:
    type: number

outputs:
  model_output:
    type: mlflow_model

code: ../src

environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest

command: >-
  python train.py
    --data_path ${{inputs.data_path}}
    --output_path ${{outputs.model_output}}
    --batch_size ${{inputs.batch_size}}
    --learning_rate ${{inputs.learning_rate}}

Note I simplified the arguments just to focus on getting this to work. Additionally, I fixed the parser.add_arguments as per one of the comments.