tensorflowkerasgoogle-colaboratorytpugoogle-cloud-tpu

Extremely slow when saving model on Colab TPU


my situation is that saving model is extremely slow under Colab TPU environment.

I first encountered this issue when using checkpoint callback, which causes the training stuck at the end of the 1st epoch.

Then, I tried taking out callback and just save the model using model.save_weights(), but nothing has changed. By using Colab terminal, I found that the saving speed is about ~100k for 5 minutes.

The version of Tensorflow = 2.3

My code of model fitting is here:

with tpu_strategy.scope(): # creating the model in the TPUStrategy scope means we will train the model on the TPU

    Baseline = create_model()
    checkpoint = keras.callbacks.ModelCheckpoint('baseline_{epoch:03d}.h5', 
                                 save_weights_only=True, save_freq="epoch")


    hist = model.fit(get_train_ds().repeat(), 
                steps_per_epoch = 100,
                epochs = 5,
                verbose = 1,
                callbacks = [checkpoint])

    model.save_weights("epoch-test.h5", overwrite=True)

Solution

  • I found the issue happened because I explicitly switched to graph mode by writing

    from tensorflow.python.framework.ops import disable_eager_execution
    disable_eager_execution()
    

    Before

    with tpu_strategy.scope():
        model.fit(...)
    

    Though I still don't understand the cause, remove disable_eager_execution solved the issue.