[SOLVED] What is the difference between mteval-v13a.pl and NLTK BLEU?

What is the difference between mteval-v13a.pl and NLTK BLEU?

There is an implementation of BLEU score in Python NLTK, nltk.translate.bleu_score.corpus_bleu

But I am not sure if it is the same as the mtevalv13a.pl script.

What is the difference between them?

Solution

TL;DR

Use https://github.com/mjpost/sacrebleu when evaluating Machine Translation systems.

In Short

No, the BLEU in NLTK isn't the exactly the same as the mteval-13a.perl.

nltk.translate.corpus_bleu corresponds to mteval-13a.pl up to the 4th order of ngram with some floating point discrepancies

import nltk
nltk.download('wmt15_eval')

The major differences:

In Long

There are several difference between mteval-13a.pl and nltk.translate.corpus_bleu:

The first difference is the fact that mteval-13a.pl comes with its own NIST tokenizer while the NLTK version of BLEU is the implementation of the metric and assumes that input is pre-tokenized.
- BTW, this ongoing PR will bridge the gap between NLTK and NIST tokenizers
The other major difference is that mteval-13a.pl expects the input to be in .sgm format while NLTK BLEU takes in python list of lists of strings, see the README.txt in the zipball here for more information of how to convert textfile to SGM.
mteval-13a.pl expects an ngram order of at least 1-4. If the minimum ngram order for the sentence/corpus is less than 4, it will return a 0 probability which is a math.log(float('-inf')). To emulate this behavior, NLTK has a put an _emulate_multibleu flag:
- See https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L477
mteval-13a.pl is able to generate NIST scores while NLTK doesn't have NIST score implementation (at least not yet)
- NIST score in NLTK is upcoming in this PR

Other than the differences, NLTK BLEU scores packed in more features:

to handle fringe cases that the original BLEU (Papineni, ‎2002) overlooked
- See https://github.com/nltk/nltk/pull/1383
Also to handle fringe cases where the largest order of Ngram is < 4, the uniform weights of the individual ngram precision will be reweighted such that the mass of the weights sums to 1.0
- See https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L175
while NIST has a smoothing method for geometric sequence smoothing, NLTK has an equivalent object with the same smoothing method and even more smoothing methods to handle sentence level BLEU from Chen and Collin, 2014

Lastly to validate the features added in NLTK's version of BLEU, a regression test is added to accounts for them, see https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py