huggingface-transformershuggingface-trainer

Cannot change training arguments when resuming from a checkpoint


I noticed when resuming the training of a model from a checkpoint changing properties like save_steps and per_device_train_batch_size has no affect. I'm wondering if there's something syntactically wrong here or technically the config of the model checkpoint overrides everything?

import transformers
from datetime import datetime

tokenizer.pad_token = tokenizer.eos_token

learning_rate = 5e-5  
warmup_steps = 100

gradient_accumulation_steps = 2  

trainer = transformers.Trainer(
    model=model,
    callbacks=[upload_checkpoint_callback],
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=warmup_steps,
        per_device_train_batch_size=8,
        gradient_checkpointing=True,
        gradient_accumulation_steps=gradient_accumulation_steps,
        max_steps=5000,
        learning_rate=learning_rate,
        logging_steps=10,
        fp16=True,
        optim="paged_adamw_8bit",
        logging_dir="/content/logs",       
        save_strategy="steps",      
        save_steps=10,              
        evaluation_strategy="steps", 
        eval_steps=10,               
        load_best_model_at_end=True,
        report_to="wandb",           
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
trainer.train(resume_from_checkpoint="/content/latest_checkpoint/")

Solution

  • the transformers library does not have the ability to change training arguments when resuming from a checkpoint.