I want to be able to start training the relevant model from the continuation of the previous day's training, but each time the training starts from a certain checkpoint, not from the last checkpoint, and this makes the training time of the model longer each time. By changing the value of "continue_from_global_step" parameter to 1, there was no change in the result. Code snippet related to loading checkpoints:
if args.do_train:
# If output files already exists, assume to continue training from latest checkpoint (unless overwrite_output_dir is set)
continue_from_global_step = 0 # If set to 0, start training from the beginning
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/*/' + WEIGHTS_NAME, recursive=True)))
if len(checkpoints) > 0:
checkpoint = checkpoints[-1]
logger.info("Resuming training from the latest checkpoint: %s", checkpoint)
continue_from_global_step = int(checkpoint.split('-')[-1])
model = model_class.from_pretrained(checkpoint)
model.to(args.device)
train_dataset, features = load_and_cache_examples(args, model, tokenizer, processor, evaluate=False)
global_step, tr_loss = train(args, train_dataset, features, model, tokenizer, processor, continue_from_global_step)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
The problem was solved by manually giving the absolute address of the last checkpoint. The last checkpoint was recognized as wrong.