pythonmachine-learningartificial-intelligencestatistical-test

Calculating BLEU score between candidate and reference sentences in python


I am calculating BLEU score between 2 sentences which seem very similar to me but I am getting BLEU score as very low. Is it supposed to happen?

prediction = "I am ABC."
reference = "I'm ABC."

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
# Tokenize the sentences
prediction_tokens = prediction.split()
reference_tokens = reference.split()
   
# Calculate BLEU score
bleu_score = sentence_bleu([reference_tokens], prediction_tokens, smoothing_function=SmoothingFunction().method4)

# Print the BLEU score
print(f"BLEU score: {bleu_score:.4f}")

Output is 0.0725

Solution

  • Yes, for two reasons:

    1. Tokenization: "I am" vs. "I'm" lead to different tokenizations. BLEU scores are very sensitive to tokens. The different tokenization should lead to substantial impact.
    2. Short sentences: BLEU are notoriously unreliable for very short texts. The score will be disproportionately influenced by small differences if there isn't much text to begin with because there is a Brevity Penalty set in place to discourage the model from outputting fewer words and get a high score. Please take a look at the "Brevity Penalty" section in the following article.

    Hope you found the answer you are looking for 🤓.