I'm learning MLflow in Databricks using the tutorial https://docs.databricks.com/_extras/notebooks/source/mlflow/mlflow-end-to-end-example-uc.html. The tutorial includes using nested MLflow runs for hyperparameter optimization of XGBoost. A parent run is created via
with mlflow.start_run(run_name='xgboost_models'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals=96,
trials=spark_trials,
)
which invokes the model training process defined by
def train_model(params):
mlflow.xgboost.autolog()
with mlflow.start_run(nested=True):
train = xgb.DMatrix(data=X_train, label=y_train)
validation = xgb.DMatrix(data=X_val, label=y_val)
# Additional training code here
The successful result is that on the Databricks default Experiments page (i.e., MLflow GUI pointing to default location), I see a run called xgboost_models
that can be expanded to show a list of child runs where actual ML training was performed. The parent-child grouping as instructed by mlflow.start_run(nested=True)
came out nicely.
Trouble comes when I decide that my runs should be logged to an Experiments location that I choose myself, instead of the default location in Databricks. First, I create the new location:
EXPERIMENT_NAME = '/Users/dxxxx@redacted.com/MLflow_experiments/dxxxx_minimal_MLflow'
# Get the experiment ID if it exists, or create a new one
experiment_id = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
if experiment_id is None:
# If the experiment does not exist, create it
experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
else:
# If the experiment exists, get its ID
experiment_id = experiment_id.experiment_id
This goes well, in the sense that if I execute a single unnested ML run via with mlflow.start_run(experiment_id=experiment_id, run_name='untuned_random_forest')
, the new log for run untuned_random_forest
shows up on the dxxxx_minimal_MLflow
Experiments page.
It really gets weird when I try this with the hyperopt nested runs. If I modify the outer call to read
with mlflow.start_run(experiment_id=experiment_id, run_name='xgboost_models_2'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals=96,
trials=spark_trials,
)
and change nothing else, my new parent run xgboost_models_2
shows up on the dxxxx_minimal_MLflow
experiment page with no children. And all the child runs show up on back on the default experiment page with no parent -- which is pretty hideous!
Checking on the detail, it may be important to note that the child runs do have a Parent ID tag, and its value seems to be set correctly to point to the ID corresponding to the xgboost_models_2
parent run. This leads me to suspect that the nested
argument to mlflow.start_run(nested=True)
is doing its job well, and somehow the GUI is simply failing to interpret the parent-child relationship correctly.
Questions:
Footnote: I've tried to fix this by shoving additional parameters into the child invocations of mlflow.start_run()
, such as experiment_id
and parent_run_id
, but that seems to make no difference. And that seems very reasonable, because as I noted above, the child runs seem to be correctly tagged with the Parent Run ID in the first place.
So, a solution.
By logging some extra parameters from the child runs, I determined that my MLflow environment (by whose fault, I can't say) creates the child runs with a different experiment_id parameter value than that of the parent run, seemingly in total defiance of nested=True
and in utter disregard for any parameters like experiment_id
or parent_run_id
that I might pass into the child invocation of mlflow.start_run()
.
However, we can set experiment_id
globally at the point where we initially created/obtained the desired experiment_id
in the first place. I mean, the block that sets and uses EXPERIMENT_NAME
. Just add the following line to the end of that block:
mlflow.set_experiment(experiment_id=experiment_id)
(But still, the failure of nested=True
doesn't seem like a very nice thing.)