For evaluating a sequence generation model, I'm using BLEU1:BLEU4. I separated the test set to two sets and calculated the scores on each set separately, as well as, on the whole test set. Surprisingly, the results I get from the whole test set is not the weighted average of the results I get from each set. For example, consider the BLEU4 scores I get on a set and two subsets of it:
set1, 866 elements: 0.0001529267908
set2, 1010 elements: 0.1625387989
<set1,set2>, 1876 elements: 0.3063472152
How should I aggregate the results on two subsets to get the overall result?
Note: I know that all the elements in set1 are shorter than 4 tokens that's why BLEU4 is almost zero there.
BLEU score is by definition non-linear. As you can see in the original paper by Papineni et al.:
It is a product of two terms: brevity penalty (BP) and a harmonic mean of n-gram precisions. Both the brevity penalty and the harmonic mean are not linear operations with respect to averaging.
Regarding what you should report: since the two tests set look fundamentally different, the best option is to report two separate numbers.
I don't know what your task is, but given that the desired outputs are very short, BLEU might not be the best choice for evaluation. You might consider something edit-based (e.g., TER) or even plain accuracy might do a good job.