I noticed when resuming the training of a model from a checkpoint changing properties like save_steps
and per_device_train_batch_size
has no affect. I'm wondering if there's something syntactically wrong here or technically the config of the model checkpoint overrides everything?
import transformers
from datetime import datetime
tokenizer.pad_token = tokenizer.eos_token
learning_rate = 5e-5
warmup_steps = 100
gradient_accumulation_steps = 2
trainer = transformers.Trainer(
model=model,
callbacks=[upload_checkpoint_callback],
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
args=transformers.TrainingArguments(
output_dir=output_dir,
warmup_steps=warmup_steps,
per_device_train_batch_size=8,
gradient_checkpointing=True,
gradient_accumulation_steps=gradient_accumulation_steps,
max_steps=5000,
learning_rate=learning_rate,
logging_steps=10,
fp16=True,
optim="paged_adamw_8bit",
logging_dir="/content/logs",
save_strategy="steps",
save_steps=10,
evaluation_strategy="steps",
eval_steps=10,
load_best_model_at_end=True,
report_to="wandb",
run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}" # Name of the W&B run (optional)
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train(resume_from_checkpoint="/content/latest_checkpoint/")
the transformers
library does not have the ability to change training arguments when resuming from a checkpoint.