azureazure-functionsazure-machine-learning-service

Getting (x) Input path can't be empty for jobs. error while submitting a command job in Azure ml


I am trying to submit a command job in azure ml:

ml_client.create_or_update(training_job)

But getting below error:

MlException: 


1) At least one required parameter is missing

Details: 

(x) Input path can't be empty for jobs.

Resolutions: 
1) Ensure all parameters required by the Job schema are specified.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

I've specified all parameters for the job, not sure why I am getting the error.

code:

training_job = command(name='credit_default_train1',
                        display_name='Credit Default Job',
                        description='Credit default training job',
                        environment=env,
                        code=training_folder,
                        inputs={
                            'train_data' : Input(type='uri_folder'),
                            'test_data' : Input(type='uri_folder'),
                            'n_estimators' : 100,
                            'learning_rate' : 0.001,
                                },
                        command='''python train.py \
                                    --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} \
                                    --n_estimators ${{inputs.n_estimators}} --learning_rate ${{inputs.learning_rate}}'''
                        )

The corresponding train.py code is as below:

%%writefile {training_folder}/train.py
import pandas as pd
import numpy as np
import os
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
import argparse
import mlflow

def main():
    # Main part of the training script
    parser = argparse.ArgumentParser()

    parser.add_argument('--train_data', type=str, help='Training data')
    parser.add_argument('--test_data', type=str, help='Validation data')
    parser.add_argument('--n_estimators', type=int, help='Number of trees in the XGboost model')
    parser.add_argument('--learning_rate', type=float, default=0.1, help='Learning rate for the XGBoost')
            
    args = parser.parse_args()
    
    mlflow.autolog()
                        
    train = pd.read_csv(os.path.join(args.train_data, os.listdir(args.train_data)[0]))
    test = pd.read_csv(os.path.join(args.test_data, os.listdir(args.test_data)[0]))
    
    cols_list = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length',
                 'person_home_ownership_MORTGAGE', 'person_home_ownership_OTHER', 'person_home_ownership_OWN', 'person_home_ownership_RENT',
                 'loan_intent_DEBTCONSOLIDATION', 'loan_intent_EDUCATION', 'loan_intent_HOMEIMPROVEMENT', 'loan_intent_MEDICAL', 'loan_intent_PERSONAL',
                 'loan_intent_VENTURE', 'loan_grade_A', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D', 'loan_grade_E', 'loan_grade_F', 'loan_grade_G',
                 'cb_person_default_on_file_N', 'cb_person_default_on_file_Y']

    X_train = train.loc[ : , cols_list].values
    y_train = train['loan_status']

    X_test = test.loc[ : , cols_list].values
    y_test = test['loan_status']

    xgb = XGBClassifier(n_estimators = args.n_estimators, learning_rate = args.learning_rate)

    xgb.fit(X_train, y_train)

    y_pred = xgb.predict(X_test)
    accuracy = f1_score(y_test, y_pred)
    mlflow.log_metric("Accuracy", accuracy)


if '__name__' == '__main__':
    main()

The train and test data are coming from previous data preprocessing step.

The code expect 4 inputs: train_data, test_data, n_estimators and learning_rate and I am giving 4 inputs accordingly. Can someone please let me know where I am going wrong.

Pre-process step:

%%writefile {data_prep_folder}/data_prep.py
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import argparse
import mlflow
import os

def main():
    # Main function that pre-processes and splits the data
    mlflow.autolog()
    parser = argparse.ArgumentParser()
    parser.add_argument('--data', type=str, help='Input data')
    parser.add_argument('--test_size', type = float, help = 'Test data size', default = 0.2)
    parser.add_argument('--train_data', type=str, help='Training data')
    parser.add_argument('--test_data', type=str, help='Test data')
    args = parser.parse_args()

    df = pd.read_csv(args.data)
    ...
    ...        
    train.to_csv(os.path.join(args.train_data, 'data.csv'))
    test.to_csv(os.path.join(args.test_data, 'data.csv'))


if __name__ == '__main__':
    main()

The corresponding command

from azure.ai.ml.constants import AssetTypes
data_prep_job = command(name='data_prep_job10',
                        description='Data Preparation for Loan Default',
                        display_name='Data Prep',
                        environment=env,
                        code=data_prep_folder,
                        inputs={
                            'data' : Input(type=AssetTypes.URI_FILE,path='azureml://subscriptions/...dataset.csv'),
                            'test_size' : .25
                                },
                        outputs={
                            'train_data' : Output(type = 'uri_folder', mode = 'rw_mount'),
                            'test_data' : Output(type = 'uri_folder', mode = 'rw_mount')
                        },
                        command='''python data_prep.py --data ${{inputs.data}} --test_size ${{inputs.test_size}} --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}'''
                        )

Solution

  • You need to pass the outputs of data_prep_job to training_job like below in pipeline configuration.

    data_prep_job = data_prep_component(
            data=pipeline_job_data_input,
            test_train_ratio=pipeline_job_test_train_ratio,
        )
    
        # using train_func like a python call with its own inputs
    train_job = train_component(
            train_data=data_prep_job.outputs.train_data,  # note: using outputs from previous step
            test_data=data_prep_job.outputs.test_data,  # note: using outputs from previous step
            learning_rate=pipeline_job_learning_rate,  # note: using a pipeline input as parameter
            registered_model_name=pipeline_job_registered_model_name,
        )
    

    It also mentioned in the samples you provided in Create the pipeline from components section refer it.

    UPDATE

    The command jobs should be used in pipeline component like mentioned in GitHub repository. If the job is submitted alone, then make sure to give all the dependencies for the command job.