machine-learningnlpnltkmachine-translationbleu

What is the difference between mteval-v13a.pl and NLTK BLEU?


There is an implementation of BLEU score in Python NLTK, nltk.translate.bleu_score.corpus_bleu

But I am not sure if it is the same as the mtevalv13a.pl script.

What is the difference between them?


Solution

  • TL;DR

    Use https://github.com/mjpost/sacrebleu when evaluating Machine Translation systems.

    In Short

    No, the BLEU in NLTK isn't the exactly the same as the mteval-13a.perl.

    But it can get really close, see https://github.com/nltk/nltk/issues/1330#issuecomment-256237324

    nltk.translate.corpus_bleu corresponds to mteval-13a.pl up to the 4th order of ngram with some floating point discrepancies

    The details of the comparison and the dataset used can be downloaded from https://github.com/nltk/nltk_data/blob/gh-pages/packages/models/wmt15_eval.zip or:

    import nltk
    nltk.download('wmt15_eval')
    

    The major differences:

    enter image description here


    In Long

    There are several difference between mteval-13a.pl and nltk.translate.corpus_bleu:

    Other than the differences, NLTK BLEU scores packed in more features:

    Lastly to validate the features added in NLTK's version of BLEU, a regression test is added to accounts for them, see https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py