machine-learningdeep-learning

How to automatically judge whether the training process of the deep learning model is converged?


When training a deep learning model, I have to look at the loss curve and performance curve to judge whether the training process of the deep learning model is converged.

This has cost me a lot of time. Sometimes, the time of convergence judged by the naked eye is not accurate.

Is there an algorithm or a package that can automatically judge whether the training process of the deep learning model is converged?


Solution

  • To the risk of disappointing you, I believe there is no such universal algorithm. In my experience, it depends on what you want to achieve, which metrics are important to you and how much time you are willing to let the training go on for.

    If you really need an algorithm, I would suggest this quite simple one :

    1. Compute a validation metric M(i) after each ith epoch on a fixed subset of your validation set or the whole validation set. Let's suppose that the higher M(i)is, the better. Fix k an integer depending on the duration of one training epoch (k~3 should do the trick)
    2. If for some n you have M(n) > max(M(n+1), ..., M(n+k)), stop and keep the network you had at epoch n.

    It's far from perfect, but should be enough for simple tasks.

    [Edit] If you're not using it yet, I invite you to use TensorBoard to visualize the evolution of your metrics throughout the training. Once set up, it is a huge gain of time.