When training a model with Huggingface Trainer object, e.g. from https://www.kaggle.com/code/alvations/neural-plasticity-bert2bert-on-wmt14
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import os
os.environ["WANDB_DISABLED"] = "true"
batch_size = 2
# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
output_dir="./",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_with_generate=True,
logging_steps=2, # set to 1000 for full training
save_steps=16, # set to 500 for full training
eval_steps=4, # set to 8000 for full training
warmup_steps=1, # set to 2000 for full training
max_steps=16, # delete for full training
# overwrite_output_dir=True,
save_total_limit=1,
#fp16=True,
)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
)
trainer.train()
When it finished training, it outputs:
TrainOutput(global_step=16, training_loss=10.065429925918579, metrics={'train_runtime': 541.4209, 'train_samples_per_second': 0.059, 'train_steps_per_second': 0.03, 'total_flos': 19637939109888.0, 'train_loss': 10.065429925918579, 'epoch': 0.03})
If we want to continue training with more steps, e.g. max_steps=16
(from previous trainer.train()
run) and another max_steps=160
, do we do something like this?
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import os
os.environ["WANDB_DISABLED"] = "true"
batch_size = 2
# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
output_dir="./",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_with_generate=True,
logging_steps=2, # set to 1000 for full training
save_steps=16, # set to 500 for full training
eval_steps=4, # set to 8000 for full training
warmup_steps=1, # set to 2000 for full training
max_steps=16, # delete for full training
# overwrite_output_dir=True,
save_total_limit=1,
#fp16=True,
)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
)
# First 16 steps.
trainer.train()
# set training arguments - these params are not really tuned, feel free to change
training_args_2 = Seq2SeqTrainingArguments(
output_dir="./",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_with_generate=True,
logging_steps=2, # set to 1000 for full training
save_steps=16, # set to 500 for full training
eval_steps=4, # set to 8000 for full training
warmup_steps=1, # set to 2000 for full training
max_steps=160, # delete for full training
# overwrite_output_dir=True,
save_total_limit=1,
#fp16=True,
)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args_2,
train_dataset=train_data,
eval_dataset=val_data,
)
# Continue training for 160 steps
trainer.train()
If the above is not the canonical way to continue training a model, how to continue training with HuggingFace Trainer?
With transformers version, 4.29.1
, trying @maciej-skorski answer with Seq2SeqTrainer
,
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
resume_from_checkpoint=True
)
Its throwing an error:
TypeError: Seq2SeqTrainer.__init__() got an unexpected keyword argument 'resume_from_checkpoint'
If your use-case is about adjusting a somewhat-trained model then it can be solved just the same way as fine-tuning. To this end, you pass the current model state along with a new parameter config to the Trainer
object in PyTorch API. I would say, this is canonical :-)
The code you proposed matches the general fine-tuning pattern from huggingface docs
trainer = Trainer(
model,
tokenizer=tokenizer,
training_args,
train_dataset=...,
eval_dataset=...,
)
You may also resume training from existing checkpoints
trainer.train(resume_from_checkpoint=True)