pythonazure-ml-pipelines

Invalid data binding expression when running AzureML pipeline


I'm running an AzureML pipeline using the command line where the sole job (for now) is a sweep.

When I run run_id=$(az ml job create -f path_to_pipeline/pipeline.yaml --query name -o tsv -g grp_name -w ws-name), I get the following error:

ERROR: Met error <class 'Exception'>:{
  "result": "Failed",
  "errors": [
    {
      "message": "Invalid data binding expression: inputs.data, outputs.model_output, search_space.batch_size, search_space.learning_rate",
      "path": "command",
      "value": "python train.py --data_path ${{inputs.data}} --output_path ${{outputs.model_output}} --batch_size ${{search_space.batch_size}} --learning_rate ${{search_space.learning_rate}}"
    }
  ]
}

The pipeline yaml looks like this:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: pipeline_with_hyperparameter_sweep
description: Tune hyperparameters
settings:
  default_compute: azureml:compute-name  # sub with your compute name
jobs:
  sweep_step:
    type: sweep
    inputs:
      data:
        type: uri_file
        path: azureml:code_train_data:1  #data store I created
    outputs:
      model_output:
    sampling_algorithm: random
    search_space:
      batch_size:
        type: choice
        values: [1, 5, 10, 15]
      learning_rate:
        type: loguniform
        min_value: -6.90775527898 # ln(0.001)
        max_value: -2.30258509299 # ln(0.1)
    trial:
      code: ../src
      command: >-
        python train.py 
        --data_path ${{inputs.data}} 
        --output_path ${{outputs.model_output}} 
        --batch_size ${{search_space.batch_size}} 
        --learning_rate ${{search_space.learning_rate}}
      environment: azureml:env_finetune_component:1
    objective:
      goal: maximize
      primary_metric: bleu_score
    limits:
      max_total_trials: 5
      max_concurrent_trials: 3
      timeout: 3600
      trial_timeout: 720

For the train.py file, note that I of course have a lot of actual code in in the main function, but I commented it out with pass to check if it makes a difference and the error is the same. So the problem is upstream with the bindings, not what's inside of train.

import argparse

def main(args):
    pass

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_arguments("--data_path")
    parser.add_arguments("--output_path")
    parser.add_arguments("--batch_size", type=int)
    parser.add_arguments("--learning_rate", type=float)
    args = parser.parse_args()

    return args


if __name__ == "__main__":

    args = parse_args()

    main(args)

If helpful, here's output when I run az version:

{
  "azure-cli": "2.53.0",
  "azure-cli-core": "2.53.0",
  "azure-cli-telemetry": "1.1.0",
  "extensions": {
    "ml": "2.20.0"
  }
}

Solution

  • I found the solution. The pipeline.yaml syntax for trial is in fact just trial: filename.yaml:

    $schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
    type: pipeline
    display_name: codegen_sweep
    description: Tune hyperparameters
    settings:
      default_compute: azureml:roma2
    jobs:
      sweep_step:
        type: sweep
        inputs:
          data_path:
            type: uri_file
            path: azureml:code_train_data:1
        outputs:
          model_output:
        sampling_algorithm: random
        search_space:
          batch_size:
            type: choice
            values: [1, 5, 10, 15]
          learning_rate:
            type: loguniform
            min_value: -6.90775527898 # ln(0.001)
            max_value: -2.30258509299 # ln(0.1)
        trial: ./train.yaml
        objective:
          goal: maximize
          primary_metric: eval_bleu_score # how mlflow outputs in other models
        limits:
          max_total_trials: 5
          max_concurrent_trials: 3
          timeout: 3600 # 1 hour
          trial_timeout: 720 # 20 mins
    

    There was another problem. In the train.yaml file, my source directory is parallel, so I needed to specify using ../src:

    $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
    type: command
    
    name: train_model
    display_name: train_model
    version: 1
    
    inputs:
      data_path:
        type: uri_file
      batch_size:
        type: integer
      learning_rate:
        type: number
    
    outputs:
      model_output:
        type: mlflow_model
    
    code: ../src
    
    environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
    
    command: >-
      python train.py
        --data_path ${{inputs.data_path}}
        --output_path ${{outputs.model_output}}
        --batch_size ${{inputs.batch_size}}
        --learning_rate ${{inputs.learning_rate}}
    

    Note I simplified the arguments just to focus on getting this to work. Additionally, I fixed the parser.add_arguments as per one of the comments.