pythonrayray-tune

How to print metrics per epoch for the best model from ray tune?


I have this code:

from ray import tune
from ray import air
from ray.air.config import RunConfig
from ray.tune.search.hyperopt import HyperOptSearch
from hyperopt import fmin, hp, tpe, Trials, space_eval, STATUS_OK
import os


config_dict = {
            "c_hidden": tune.choice([64]),
            "dp_rate_linear": tune.choice([0.1]), #could change to quniform and give a 3-point tuple range
            "num_layers":tune.choice([3]),
            "dp_rate":tune.choice([0.3])
              }
search_alg = HyperOptSearch()

hyperopt_search = HyperOptSearch(
    metric="val_loss", mode="min")
    #points_to_evaluate=current_best_params)


#tuner = tune.Tuner(tune.with_resources(train_fn, {"gpu": 1}), run_config= RunConfig(local_dir='/home/runs/',sync_config=tune.SyncConfig,checkpoint_config=air.CheckpointConfig()), tune_config=tune.TuneConfig(num_samples=1,search_alg=hyperopt_search),param_space=config_dict)
reporter = CLIReporter(parameter_columns=['c_hidden'],metric_columns=["val_loss", "val_acc", "training_iteration"])
tuner = tune.Tuner(tune.with_resources(train_fn, {"gpu": 1}), tune_config=tune.TuneConfig(num_samples=1,search_alg=hyperopt_search),param_space=config_dict,run_config= RunConfig(local_dir='/home/runs/'))
results = tuner.fit()
best_result = results.get_best_result(metric="val_loss", mode="min") #add .config to see best

best_checkpoint = best_result.checkpoint
path = os.path.join(str(best_checkpoint.to_directory()), "ray_ckpt3")
model = GraphLevelGNN.load_from_checkpoint(path)
print(path)

It runs ray tune on a network, does a hyperparameter optimization and saves the best network - I just can't work out how to get it to save the metrics I've asked for in the reporter variable to file - i.e. for the best run, how do I save the val acc and loss over epochs to a file so I can plot these?


Solution

  • The following works for me:

    results = tuner.fit()
    metrics_frames = [result.metrics_dataframe for result in results]
    

    when adding this as the return value of the training function:

    return {
        'score': score,
        'validation_loss': validation_losses,
        'training_loss': training_losses,
    }
    

    In order to get them into a format I could use, though, I had to do some casting because it seems the list of loss values per epoch in each of the resulting data frames gets somehow changed to string values at some point after returning them from my training function but before I access them. I first cut off the [ and ] characters with [1:-1] and then split the remaining string on the , character and finally cast each to a float:

    training_losses = [[float(l) for l in frame['training_loss'][0][1:-1].split(',')] for frame in metrics_frames]
    

    Also, in case anybody wants it, here's how I set up my tuner:

    tuner = Tuner(
        self.train,
        tune_config=TuneConfig(
            num_samples=NUM_TRIALS,
            max_concurrent_trials=MAX_CONCURRENT_TRIALS,
            scheduler=ASHAScheduler(metric="score", mode="max"),
        ),
        param_space=param_search_space,
    )
    results = tuner.fit()