machine-learning deep-learning

How to automatically judge whether the training process of the deep learning model is converged?

When training a deep learning model, I have to look at the loss curve and performance curve to judge whether the training process of the deep learning model is converged.

This has cost me a lot of time. Sometimes, the time of convergence judged by the naked eye is not accurate.

Is there an algorithm or a package that can automatically judge whether the training process of the deep learning model is converged?

Solution

To the risk of disappointing you, I believe there is no such universal algorithm. In my experience, it depends on what you want to achieve, which metrics are important to you and how much time you are willing to let the training go on for.

I have already seen validation losses dramatically go up (a sign of overfitting) while other metrics (mIoU in this case) were still improving on the validation set. In these cases, you need to know what your target is.
It is possible (although it is very rare) that your loss goes up for a substantial amount of time before going down again and reach better levels than before. There is no way to anticipate this.
Finally, and this is arguably a common case if you have tons of training data, your validation loss may continually go down, but do so slower and slower. In this case, the best strategy if you had an infinite amount of time would be to let it keep the training going indefinitely. In practice, this is impossible, and you would need to find the right balance between performance and training time.

If you really need an algorithm, I would suggest this quite simple one :

Compute a validation metric M(i) after each ith epoch on a fixed subset of your validation set or the whole validation set. Let's suppose that the higher M(i)is, the better. Fix k an integer depending on the duration of one training epoch (k~3 should do the trick)
If for some n you have M(n) > max(M(n+1), ..., M(n+k)), stop and keep the network you had at epoch n.

It's far from perfect, but should be enough for simple tasks.

[Edit] If you're not using it yet, I invite you to use TensorBoard to visualize the evolution of your metrics throughout the training. Once set up, it is a huge gain of time.