I am trying to submit a command job in azure ml:
ml_client.create_or_update(training_job)
But getting below error:
MlException:
1) At least one required parameter is missing
Details:
(x) Input path can't be empty for jobs.
Resolutions:
1) Ensure all parameters required by the Job schema are specified.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command
I've specified all parameters for the job, not sure why I am getting the error.
code:
training_job = command(name='credit_default_train1',
display_name='Credit Default Job',
description='Credit default training job',
environment=env,
code=training_folder,
inputs={
'train_data' : Input(type='uri_folder'),
'test_data' : Input(type='uri_folder'),
'n_estimators' : 100,
'learning_rate' : 0.001,
},
command='''python train.py \
--train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} \
--n_estimators ${{inputs.n_estimators}} --learning_rate ${{inputs.learning_rate}}'''
)
The corresponding train.py
code is as below:
%%writefile {training_folder}/train.py
import pandas as pd
import numpy as np
import os
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
import argparse
import mlflow
def main():
# Main part of the training script
parser = argparse.ArgumentParser()
parser.add_argument('--train_data', type=str, help='Training data')
parser.add_argument('--test_data', type=str, help='Validation data')
parser.add_argument('--n_estimators', type=int, help='Number of trees in the XGboost model')
parser.add_argument('--learning_rate', type=float, default=0.1, help='Learning rate for the XGBoost')
args = parser.parse_args()
mlflow.autolog()
train = pd.read_csv(os.path.join(args.train_data, os.listdir(args.train_data)[0]))
test = pd.read_csv(os.path.join(args.test_data, os.listdir(args.test_data)[0]))
cols_list = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length',
'person_home_ownership_MORTGAGE', 'person_home_ownership_OTHER', 'person_home_ownership_OWN', 'person_home_ownership_RENT',
'loan_intent_DEBTCONSOLIDATION', 'loan_intent_EDUCATION', 'loan_intent_HOMEIMPROVEMENT', 'loan_intent_MEDICAL', 'loan_intent_PERSONAL',
'loan_intent_VENTURE', 'loan_grade_A', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D', 'loan_grade_E', 'loan_grade_F', 'loan_grade_G',
'cb_person_default_on_file_N', 'cb_person_default_on_file_Y']
X_train = train.loc[ : , cols_list].values
y_train = train['loan_status']
X_test = test.loc[ : , cols_list].values
y_test = test['loan_status']
xgb = XGBClassifier(n_estimators = args.n_estimators, learning_rate = args.learning_rate)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
accuracy = f1_score(y_test, y_pred)
mlflow.log_metric("Accuracy", accuracy)
if '__name__' == '__main__':
main()
The train
and test
data are coming from previous data preprocessing step.
The code expect 4 inputs: train_data
, test_data
, n_estimators
and learning_rate
and I am giving 4 inputs accordingly. Can someone please let me know where I am going wrong.
Pre-process step:
%%writefile {data_prep_folder}/data_prep.py
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import argparse
import mlflow
import os
def main():
# Main function that pre-processes and splits the data
mlflow.autolog()
parser = argparse.ArgumentParser()
parser.add_argument('--data', type=str, help='Input data')
parser.add_argument('--test_size', type = float, help = 'Test data size', default = 0.2)
parser.add_argument('--train_data', type=str, help='Training data')
parser.add_argument('--test_data', type=str, help='Test data')
args = parser.parse_args()
df = pd.read_csv(args.data)
...
...
train.to_csv(os.path.join(args.train_data, 'data.csv'))
test.to_csv(os.path.join(args.test_data, 'data.csv'))
if __name__ == '__main__':
main()
The corresponding command
from azure.ai.ml.constants import AssetTypes
data_prep_job = command(name='data_prep_job10',
description='Data Preparation for Loan Default',
display_name='Data Prep',
environment=env,
code=data_prep_folder,
inputs={
'data' : Input(type=AssetTypes.URI_FILE,path='azureml://subscriptions/...dataset.csv'),
'test_size' : .25
},
outputs={
'train_data' : Output(type = 'uri_folder', mode = 'rw_mount'),
'test_data' : Output(type = 'uri_folder', mode = 'rw_mount')
},
command='''python data_prep.py --data ${{inputs.data}} --test_size ${{inputs.test_size}} --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}'''
)
You need to pass the outputs of data_prep_job
to training_job
like below in pipeline configuration.
data_prep_job = data_prep_component(
data=pipeline_job_data_input,
test_train_ratio=pipeline_job_test_train_ratio,
)
# using train_func like a python call with its own inputs
train_job = train_component(
train_data=data_prep_job.outputs.train_data, # note: using outputs from previous step
test_data=data_prep_job.outputs.test_data, # note: using outputs from previous step
learning_rate=pipeline_job_learning_rate, # note: using a pipeline input as parameter
registered_model_name=pipeline_job_registered_model_name,
)
It also mentioned in the samples you provided in Create the pipeline from components section refer it.
UPDATE
The command jobs should be used in pipeline component like mentioned in GitHub repository. If the job is submitted alone, then make sure to give all the dependencies for the command job.