pythonmachine-learningdatasethuggingface-transformers

Use of Training, Validation and Test set in HuggingFace Seq2SeqTrainer


I have the following Dataset, which has 3 splits (train, validation and test). The data are parallel corpus of 2 languages.

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 109942
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 6545
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 13743
    })
})

For my Seq2SeqTrainer, I supply the dataset as follows:

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
)

Is it correct to put the validation split in eval_dataset? In the documentation, it says:

The dataset to use for evaluation. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. If it is a dictionary, it will evaluate on each dataset prepending the dictionary key to the metric name.

Or should I put the test split in eval_dataset? In either way, is that true that one of the splits is not used?


Solution

  • I am going to focus on the code side here. For a deeper theoretical explanation of why we need (or should have) training, validation and test set, see What is the difference between test set and validation set?.

    For training, using the validation set is correct, the way you already do:

    trainer = Seq2SeqTrainer(
        model = model,
        args = training_args,
        train_dataset = tokenized_dataset['train'],
        eval_dataset = tokenized_dataset['validation'],
        tokenizer = tokenizer,
        data_collator = data_collator,
        compute_metrics = compute_metrics,
    )
    

    After training, you can use .predict() or .evaluate(), with your test set.

    If you want only the metrics, and not the outputs, you can use .evaluate():

    metrics = trainer.evaluate(tokenized_dataset['test'])
    

    If you want the outputs as well as the metrics (or maybe just the outputs), you can use .predict():

    preds = trainer.predict(tokenized_dataset['test'])
    print(preds.predictions)
    print(preds.metrics)