pythontensorflowmachine-learningtraining-dataresuming-training

Is it possible to resume training from a checkpoint model in Tensorflow?


I am doing auto segmentation and I was training a model over the weekend and the power went out. I had trained my model for 50+ hours and saved my model every 5 epochs using the line:

model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)

I'm loading the saved model using the line:

model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})

I have included all of my data that splits my training data into train_x for the scan and train_y for the label. When I run the line:

loss, dice_coef = model.evaluate(train_x,  train_y, verbose=1)

I get the error:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
 [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_3673]

Function call stack:
distributed_function

Solution

  • This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.

    evaluate(train_x,  train_y, batch_size=<batch size>)
    

    from keras documentation

    batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.