pythonmeteorhuggingface-evaluate

Evaluate's METEOR Implementation returns 0 score


I have the following codes:

import evaluate
reference1 = "犯人受到了嚴密的監控。" # Ground Truth
hypothesis1 = "犯人受到嚴密監視。" # Translated Sentence

meteor = metric_meteor.compute(predictions=[hypothesis1], references=[reference1])
print("METEOR:", meteor["meteor"])

It returns 0.0.

My question: How can I make the above code produce the same score as the below codes?

However, with NLTK, the score is 98.14814814814815:

from nltk.translate.meteor_score import single_meteor_score

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('fnlp/bart-base-chinese')

tokenized_reference1 = tokenizer(reference1)
tokenized_hypothesis1 = tokenizer(hypothesis1)

print("METEOR:", single_meteor_score(tokenized_reference1, tokenized_hypothesis1) * 100)

From the Evaluate's METEOR implementation, it's actually an NLTK wrapper: https://huggingface.co/spaces/evaluate-metric/meteor/blob/main/meteor.py


Solution

  • The problem is that meteor.py uses word_tokenize as tokenizer and there doesn't seem to be a way to pass your tokenizer as argument (you could file a feature request so that the author adds it). You can however patch the tokenizer when creating metric_meteor:

    import evaluate
    from unittest.mock import patch
    import nltk
    from transformers import AutoTokenizer
    
    reference1 = "犯人受到了嚴密的監控。" # Ground Truth
    hypothesis1 = "犯人受到嚴密監視。" # Translated Sentence
    
    tokenizer = AutoTokenizer.from_pretrained('fnlp/bart-base-chinese')
    
    with patch.object(nltk, 'word_tokenize', tokenizer):
        metric_meteor = evaluate.load('meteor')
    
    meteor = metric_meteor.compute(predictions=[hypothesis1], references=[reference1], alpha=0.9, beta=3.0, gamma=0.5)
    print("METEOR:", meteor["meteor"])
    

    Output:

    METEOR: 0.9814814814814815