python-3.xmachine-learningnlptraining-datacheckpoint

Not loading all checkpoints when training again


I want to be able to start training the relevant model from the continuation of the previous day's training, but each time the training starts from a certain checkpoint, not from the last checkpoint, and this makes the training time of the model longer each time. By changing the value of "continue_from_global_step" parameter to 1, there was no change in the result. Code snippet related to loading checkpoints:

Training

if args.do_train:
    # If output files already exists, assume to continue training from latest checkpoint (unless overwrite_output_dir is set)
    continue_from_global_step = 0 # If set to 0, start training from the beginning
    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
        checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/*/' + WEIGHTS_NAME, recursive=True)))
        if len(checkpoints) > 0:
            checkpoint = checkpoints[-1]
            logger.info("Resuming training from the latest checkpoint: %s", checkpoint)
            continue_from_global_step = int(checkpoint.split('-')[-1])
            model = model_class.from_pretrained(checkpoint)
            model.to(args.device)
    
    train_dataset, features = load_and_cache_examples(args, model, tokenizer, processor, evaluate=False)
    global_step, tr_loss = train(args, train_dataset, features, model, tokenizer, processor, continue_from_global_step)
    logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

Solution

  • The problem was solved by manually giving the absolute address of the last checkpoint. The last checkpoint was recognized as wrong.