The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].
When calculating any ROUGE metric you get an AggregateScore
object with 3 parameters: low
, mid
, high
.
How are these aggregate values calculated?
For example, from the huggingface implementation [2]:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0
Edit: On July 7th, the huggingface implementation was simplified to return a cleaner and easier to understand dict: https://github.com/huggingface/evaluate/issues/148
Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding use_aggregator=False
and get these values returned.
For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for n
samples, you draw x
times a sample with replacement of size n
, and then calculate some statistic for each resample. Now you get a new distribution called the empirical bootstrap distribution
, which can be used to extract confidence intervals.
In the ROUGE implementation by google [4], they used:
n
for the number of resamples to runmean
for the resample statistic2.5th, 50th and 97.5th percentiles
to calculate the values for low, mid and high, respectively (can be controlled with the confidence_interval
param)Note that due to the bootstrapping technique used in ROUGE, it is non-deterministic, and can return different results for each run (see [5]). If you don't want to opt out from using the bootstrapping technique, you can set the seed in the load function, as such: evaluate.load('rouge', seed=42)
.