[SOLVED] Training checkpoints for deep learning with Keras

Training checkpoints for deep learning with Keras

I'm using Google Colab, and saving the weights on my drive.

Training:

def train(model, network_input, network_output):
""" train the neural network """
filepath = "/content/gdrive/MyDrive/weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(
    filepath,
    monitor='loss',
    verbose=0,
    save_best_only=True,
    mode='min'
)
callbacks_list = [checkpoint]

model.fit(network_input, network_output, epochs=200, batch_size=128, callbacks=callbacks_list)

After training for some time, I have the weights: weights in my drive

Then I resume training without modifying my functions, and the output cell looks like this: output cell

How can I know if training resumed from the best weights so far, ie "weights-improvement-06-4.1851-bigger.hdf5", or just restarted from the beginning? If it's training from the saved weights, shouldn't it show that in some way? Perhaps showing me that epochs continue from where it left off starting with Epoch 4/200 instead of 1/200.

Solution

If you are still using the same instantiated model object (i.e. you haven't instantiated a new one), it will resume training from where it left off - it won't start over.

However, if you want to instantiate a new model using the same config and start from a previously saved set of weights (checkpoint), you can use tensorflow's latest_checkpoint to load the most recent checkpoint weights from your directory before passing these weights to the model.

from tensorflow.train import latest_checkpoint

last_ckpt = latest_checkpoint(os.path.join('my','checkpoint','directory'))
# this is the newly instantiated model using the same config
model.load_weights(last_ckpt)