pythonpytorchlosshuggingface-trainer

logging training and validation loss per epoch using huggingface trainer in pytorch to assess bias-variance tradeoff


I'm fine-tuning a transformer model for text classification in Pytorch using huggingface Trainer. I would like to log both the training and the validation loss for each epoch of training. This is so that I can assess when the model starts to overfit to the training data (i.e. the point at which training loss keeps decreasing, but validation loss is stable or increasing, the bias-variance tradeoff).

Here are my training arguments of the huggingface trainer:

training_arguments = TrainingArguments(
            output_dir = os.path.join(MODEL_DIR, f'{TODAYS_DATE}_multicls_cls'),
            run_name = f'{TODAYS_DATE}_multicls_cls',
            overwrite_output_dir=True,
            evaluation_strategy='epoch',
            save_strategy='epoch',
            num_train_epochs=7.0,
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            optim='adamw_torch',
            learning_rate=LEARNING_RATE

My training arguments of the huggingface trainer are set to evaluate every epoch, as desired, but my training loss is computed every 500 steps by default. You can see this in the log history of trainer.state after training:

{'eval_loss': 6.346338748931885, 'eval_f1': 0.2146690518783542, 'eval_runtime': 1.2777, 'eval_samples_per_second': 31.306, 'eval_steps_per_second': 31.306, 'epoch': 1.0, 'step': 160}
{'eval_loss': 5.505970001220703, 'eval_f1': 0.23817863397548159, 'eval_runtime': 1.5768, 'eval_samples_per_second': 25.367, 'eval_steps_per_second': 25.367, 'epoch': 2.0, 'step': 320}
{'eval_loss': 5.21959114074707, 'eval_f1': 0.2233676975945017, 'eval_runtime': 1.3016, 'eval_samples_per_second': 30.732, 'eval_steps_per_second': 30.732, 'epoch': 3.0, 'step': 480}
{'loss': 6.1108, 'learning_rate': 2.767857142857143e-05, 'epoch': 3.12, 'step': 500}
{'eval_loss': 5.014569282531738, 'eval_f1': 0.24625623960066553, 'eval_runtime': 1.3961, 'eval_samples_per_second': 28.652, 'eval_steps_per_second': 28.652, 'epoch': 4.0, 'step': 640}
{'eval_loss': 5.090881824493408, 'eval_f1': 0.2212643678160919, 'eval_runtime': 1.2708, 'eval_samples_per_second': 31.477, 'eval_steps_per_second': 31.477, 'epoch': 5.0, 'step': 800}
{'eval_loss': 4.950728416442871, 'eval_f1': 0.23750000000000002, 'eval_runtime': 1.298, 'eval_samples_per_second': 30.816, 'eval_steps_per_second': 30.816, 'epoch': 6.0, 'step': 960}
{'loss': 3.8989, 'learning_rate': 5.357142857142857e-06, 'epoch': 6.25, 'step': 1000}
{'eval_loss': 4.940125465393066, 'eval_f1': 0.24444444444444444, 'eval_runtime': 1.4609, 'eval_samples_per_second': 27.38, 'eval_steps_per_second': 27.38, 'epoch': 7.0, 'step': 1120}
{'train_runtime': 80.7323, 'train_samples_per_second': 13.873, 'train_steps_per_second': 13.873, 'total_flos': 73700199874560.0, 'train_loss': 4.81386468069894, 'epoch': 7.0, 'step': 1120}

How can I set the training arguments to log the training loss every epoch, just like my validation loss? There is no equivalent parameter to evaluation_strategy=epoch for training in training arguments.


Solution

  • To log training loss every epoch, set logging_strategy='epoch'.

    Now I get:

    {'loss': 7.1773, 'learning_rate': 4.2857142857142856e-05, 'epoch': 1.0, 'step': 160}
    {'eval_loss': 6.232218265533447, 'eval_f1': 0.20766773162939295, 'eval_runtime': 1.2916, 'eval_samples_per_second': 30.97, 'eval_steps_per_second': 30.97, 'epoch': 1.0, 'step': 160}
    {'loss': 6.3841, 'learning_rate': 3.571428571428572e-05, 'epoch': 2.0, 'step': 320}
    {'eval_loss': 5.86290979385376, 'eval_f1': 0.2006269592476489, 'eval_runtime': 1.3634, 'eval_samples_per_second': 29.339, 'eval_steps_per_second': 29.339, 'epoch': 2.0, 'step': 320}
    {'loss': 5.5212, 'learning_rate': 2.857142857142857e-05, 'epoch': 3.0, 'step': 480}
    {'eval_loss': 5.343527793884277, 'eval_f1': 0.24319419237749546, 'eval_runtime': 1.29, 'eval_samples_per_second': 31.008, 'eval_steps_per_second': 31.008, 'epoch': 3.0, 'step': 480}
    {'loss': 4.7184, 'learning_rate': 2.1428571428571428e-05, 'epoch': 4.0, 'step': 640}
    {'eval_loss': 5.131855487823486, 'eval_f1': 0.23588039867109634, 'eval_runtime': 1.3336, 'eval_samples_per_second': 29.993, 'eval_steps_per_second': 29.993, 'epoch': 4.0, 'step': 640}
    {'loss': 4.0205, 'learning_rate': 1.4285714285714285e-05, 'epoch': 5.0, 'step': 800}
    {'eval_loss': 4.972315788269043, 'eval_f1': 0.22551928783382788, 'eval_runtime': 1.2714, 'eval_samples_per_second': 31.462, 'eval_steps_per_second': 31.462, 'epoch': 5.0, 'step': 800}
    {'loss': 3.5411, 'learning_rate': 7.142857142857143e-06, 'epoch': 6.0, 'step': 960}
    {'eval_loss': 4.964015960693359, 'eval_f1': 0.23100303951367776, 'eval_runtime': 1.2783, 'eval_samples_per_second': 31.292, 'eval_steps_per_second': 31.292, 'epoch': 6.0, 'step': 960}
    {'loss': 3.2564, 'learning_rate': 0.0, 'epoch': 7.0, 'step': 1120}
    {'eval_loss': 4.895078182220459, 'eval_f1': 0.22585438335809802, 'eval_runtime': 1.3362, 'eval_samples_per_second': 29.935, 'eval_steps_per_second': 29.935, 'epoch': 7.0, 'step': 1120}
    {'train_runtime': 81.2849, 'train_samples_per_second': 13.779, 'train_steps_per_second': 13.779, 'total_flos': 73700199874560.0, 'train_loss': 4.945595060076032, 'epoch': 7.0, 'step': 1120}]