I had to stop training in the middle, which set the Trains status to Aborted
.
Later I continued it from the last checkpoint, but the status remained Aborted
.
Furthermore, automatic training metrics stopped appearing in the dashboard (though custom metrics still do).
Can I reset the status back to Running
and make Trains log training stats again?
Edit: When continuing training, I retrieved the task using Task.get_task()
and not Task.init()
. Maybe that's why training stats are not updated anymore?
Edit2: I also tried Task.init(reuse_last_task_id=original_task_id_string)
, but it just creates a new task, and doesn't reuse the given task ID.
Disclaimer: I'm a member of Allegro Trains team
When continuing training, I retrieved the task using Task.get_task() and not Task.init(). Maybe that's why training stats are not updated anymore?
Yes that's the only way to continue the same exact Task.
You can also mark it as started with task.mark_started()
, that said the automatic logging will not kick in, as Task.get_task
is usually used for accessing previously executed tasks and not continuing it (if you think the continue use case is important please feel free to open a GitHub issue, I can definitely see the value there)
You can also do something a bit different, and justcreate a new Task continuing from the last iteration the previous run ended. Notice that if you load the weights file (PyTorch/TF/Keras/JobLib) it will automatically connect it with the model that was created in the previous run (assuming the model was stored is the same location, or if you have the model on https/S3/Gs/Azure and you are using trains.StorageManager.get_local_copy()
)
previous_run = Task.get_task()
task = Task.init('examples', 'continue training')
task.set_initial_iteration(previous_run.get_last_iteration())
torch.load('/tmp/my_previous_weights')
BTW:
I also tried Task.init(reuse_last_task_id=original_task_id_string), but it just creates a new task, and doesn't reuse the given task ID.
This is a great idea for an interface to continue a previous run, feel free to add it as GitHub issue.