I'm trying to calculate the n--gram using Python. The weight I used for for uni-gram, bi-gram, tri-gram, and 4-gram is (0.25, 0.25, 0, 0).
When I run the script for the first reference it gives me a BLEU score 0.51
the script is:
# Define your desired weights (example: higher weight for bi-grams)
weights = (0.25, 0.25, 0, 0) # Weights for uni-gram, bi-gram, tri-gram, and 4-gram
# Reference and predicted texts (same as before)
reference = [["the", "alleyway", "barely", "lives", "in", "semi", "isolation"]]
predictions = ["midaq", "alley", "lives", "in", "almost", "complete", "isolation"]
# Calculate BLEU score with weights
score = sentence_bleu(reference, predictions, weights=weights)
print(score)
But when I run the same script for the second reference it gives a BLEU score 6.91
The script is:
# Define your desired weights (example: higher weight for bi-grams)
weights = (0.25, 0.25, 0, 0) # Weights for uni-gram, bi-gram, tri-gram, and 4-gram
# Reference and predicted texts (same as before)
reference = [["the", "alley", "is", "almost", "living", "in", "a", "state", "of", "isolation"]]
predictions = ["midaq", "alley", "lives", "in", "almost", "complete", "isolation"]
# Calculate BLEU score with weights
score = sentence_bleu(reference, predictions, weights=weights)
print(score)
Why does it show this big difference although the weight and the code is the same? How do I determine the weight? Are there any specific criteria?
As mentioned here:
Only big differences in metric scores are meaningful in MT
- If System A has a BLEU score that is 1-2 point higher than System B (common in academic papers), then there is only a 50% chance that human evaluators will prefer System A over System B
- If System A has a BLEU score that is 3-5 points higher than System B, there is a 75% chance that human evaluators will prefer A over B.
- In order to get a 95% chance that human evaluators will prefer A over B, we need something like a 10 point improvement in BLEU (they dont state this, I am guessing this by eyeballing their graphs).
So a difference of 6.4
is acceptable.
You have quite a different input data, which is already quite small. So of course, the weights are different.