I trained multiple models with different configuration for a custom hyperparameter search. I use pytorch_lightning and its logging (TensorboardLogger). When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.
I log for each straining stage train
, val
and test
the following scalars at each epoch: loss
, acc
and iou
When I have multiple configuration, e.g. networkA
and networkB
the first training log its values to loss
, acc
and iou
, but the second to networkB:loss
, networkB:acc
and networkB:iou
. This makes values umcomparable.
My training loop with Task initalization looks like this:
names = ['networkA', networkB']
for name in names:
task = Task.init(project_name="NetworkProject", task_name=name)
pl_train(name)
task.close()
method pl_train is a wrapper for whole training with Pytorch Ligtning. No ClearML code is inside this method.
Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?
Edit: ClearML version was 0.17.4. Issue is fixed in main branch.
Disclaimer I'm part of the ClearML (formerly Trains) team.
pytorch_lightning
is creating a new Tensorboard for each experiment. When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one. A good example would be reporting loss
scalar in the training phase vs validation phase (producing "loss" and "validation:loss"). It might be the task.close()
call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB
to the loss
. As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series). I suggest opening a GitHub issue, this should probably be considered a bug.