devopsclearmltrains

ClearML multiple tasks in single script changes logged value names


I trained multiple models with different configuration for a custom hyperparameter search. I use pytorch_lightning and its logging (TensorboardLogger). When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.

I log for each straining stage train, val and test the following scalars at each epoch: loss, acc and iou

When I have multiple configuration, e.g. networkA and networkB the first training log its values to loss, acc and iou, but the second to networkB:loss, networkB:acc and networkB:iou. This makes values umcomparable.

My training loop with Task initalization looks like this:

names = ['networkA', networkB']
for name in names:
     task = Task.init(project_name="NetworkProject", task_name=name)
     pl_train(name)
     task.close()

method pl_train is a wrapper for whole training with Pytorch Ligtning. No ClearML code is inside this method.

Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?


Edit: ClearML version was 0.17.4. Issue is fixed in main branch.


Solution

  • Disclaimer I'm part of the ClearML (formerly Trains) team.

    pytorch_lightning is creating a new Tensorboard for each experiment. When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one. A good example would be reporting loss scalar in the training phase vs validation phase (producing "loss" and "validation:loss"). It might be the task.close() call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB to the loss. As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series). I suggest opening a GitHub issue, this should probably be considered a bug.