I have the following Dataset, which has 3 splits (train
, validation
and test
). The data are parallel corpus of 2 languages.
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 109942
})
validation: Dataset({
features: ['translation'],
num_rows: 6545
})
test: Dataset({
features: ['translation'],
num_rows: 13743
})
})
For my Seq2SeqTrainer
, I supply the dataset as follows:
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = tokenized_dataset['train'],
eval_dataset = tokenized_dataset['validation'],
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
)
Is it correct to put the validation
split in eval_dataset
? In the documentation, it says:
The dataset to use for evaluation. If it is a
Dataset
, columns not accepted by themodel.forward()
method are automatically removed. If it is a dictionary, it will evaluate on each dataset prepending the dictionary key to the metric name.
Or should I put the test
split in eval_dataset
? In either way, is that true that one of the splits is not used?
I am going to focus on the code side here. For a deeper theoretical explanation of why we need (or should have) training, validation and test set, see What is the difference between test set and validation set?.
For training, using the validation set is correct, the way you already do:
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = tokenized_dataset['train'],
eval_dataset = tokenized_dataset['validation'],
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
)
After training, you can use .predict()
or .evaluate()
, with your test set.
If you want only the metrics, and not the outputs, you can use .evaluate()
:
metrics = trainer.evaluate(tokenized_dataset['test'])
If you want the outputs as well as the metrics (or maybe just the outputs), you can use .predict()
:
preds = trainer.predict(tokenized_dataset['test'])
print(preds.predictions)
print(preds.metrics)