metricsevaluationseq2seqbleu

Average of BLEU scores on two subsets of data is not the same as overall score


For evaluating a sequence generation model, I'm using BLEU1:BLEU4. I separated the test set to two sets and calculated the scores on each set separately, as well as, on the whole test set. Surprisingly, the results I get from the whole test set is not the weighted average of the results I get from each set. For example, consider the BLEU4 scores I get on a set and two subsets of it:

set1, 866 elements: 0.0001529267908

set2, 1010 elements: 0.1625387989

<set1,set2>, 1876 elements: 0.3063472152

How should I aggregate the results on two subsets to get the overall result?

Note: I know that all the elements in set1 are shorter than 4 tokens that's why BLEU4 is almost zero there.


Solution

  • BLEU score is by definition non-linear. As you can see in the original paper by Papineni et al.:

    enter image description here

    It is a product of two terms: brevity penalty (BP) and a harmonic mean of n-gram precisions. Both the brevity penalty and the harmonic mean are not linear operations with respect to averaging.

    Regarding what you should report: since the two tests set look fundamentally different, the best option is to report two separate numbers.

    I don't know what your task is, but given that the desired outputs are very short, BLEU might not be the best choice for evaluation. You might consider something edit-based (e.g., TER) or even plain accuracy might do a good job.