pythontensorflowcheckpointing

How can I use Tensorflow.Checkpoint to recover a previously trained net


I'm trying to understand how to recover a saved/checkpointed net using tensorflow.train.Checkpoint.restore.

I'm using code that's strongly based on Google's Colab tutorial for creating a pix2pix GAN. Below, I've excerpted the key portion, which just attempts to instantiate a new net, then to fill it with weights from a previous net that was saved and checkpointed.

I'm assigning a unique(ish) id number to a particular instantiation of a net by summing all the weights of the net. I compare these id numbers both at the creation of the net, and after I've attempted to recover the checkpointed net

def main(opt):

    # Initialize pix2pix GAN using arguments input from command line
    p2p = Pix2Pix(vars(opt))
    print(opt)

    # print sum of initial weights for net
    print("Init Model Weights:", 
           sum([x.numpy().sum() for x in p2p.generator.weights]))

    # Create or read from model checkpoints
    checkpoint = tf.train.Checkpoint(generator_optimizer=p2p.generator_optimizer,
                                     discriminator_optimizer=p2p.discriminator_optimizer,
                                     generator=p2p.generator,
                                     discriminator=p2p.discriminator)
    
    # print sum of weights from checkpoint, to ensure it has access 
    # to relevant regions of p2p
    print("Checkpoint Weights:", 
           sum([x.numpy().sum() for x in checkpoint.generator.weights]))

    # Recover Checkpointed net
    checkpoint.restore(tf.train.latest_checkpoint(opt.weights)).expect_partial()

    # print sum of weights for p2p & checkpoint after attempting to restore saved net 
    print("Restore Model Weights:", 
           sum([x.numpy().sum() for x in p2p.generator.weights]))
    print("Restored Checkpoint Weights:", 
           sum([x.numpy().sum() for x in checkpoint.generator.weights]))
    print("Done.")

if __name__ == '__main__':
    opt = parse_opt()
    main(opt)

The output I got when I ran this code was as follows:

Namespace(channels='1', data='data', img_size=256, output='output', weights='weights/ckpt-40.data-00000-of-00001')
## These are the input arguments, the images have only 1 channel (they're gray scale)
## The directory with data is ./data, the images are 265x256
## The output directory is ./output
## The checkpointed net is stored in ./weights/ckpt-40.data-00000-of-00001


## Sums of nets' weights
Init Model Weights: 11047.206374436617
Checkpoint Weights: 11047.206374436617
Restore Model Weights: 11047.206374436617
Restored Checkpoint Weights: 11047.206374436617

Done.

There is no change in the sum of the net's weights before and after recovering the checkpointed version, although p2p and checkpoint do seem to have access to the same locations in memory.

Why am I not recovering the saved net?


Solution

  • The problem arose because tf.Checkpoint.restore needs the directory in which the checkpointed net is stored, not the specific file (or, what I took to be the specific file - ./weights/ckpt-40.data-00000-of-00001)

    When it is not given a valid directory, it silently proceeds to the next line of code, without updating the net or throwing an error. The fix was to give it the directory with the relevant checkpoint files, rather than just the file I believed to be relevant.