Evaluate's METEOR Implementation returns 0 score

I have the following codes:

import evaluate
reference1 = "犯人受到了嚴密的監控。" # Ground Truth
hypothesis1 = "犯人受到嚴密監視。" # Translated Sentence

meteor = metric_meteor.compute(predictions=[hypothesis1], references=[reference1])
print("METEOR:", meteor["meteor"])

It returns 0.0.

My question: How can I make the above code produce the same score as the below codes?

However, with NLTK, the score is 98.14814814814815:

from nltk.translate.meteor_score import single_meteor_score

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('fnlp/bart-base-chinese')

tokenized_reference1 = tokenizer(reference1)
tokenized_hypothesis1 = tokenizer(hypothesis1)

print("METEOR:", single_meteor_score(tokenized_reference1, tokenized_hypothesis1) * 100)

From the Evaluate's METEOR implementation, it's actually an NLTK wrapper: https://huggingface.co/spaces/evaluate-metric/meteor/blob/main/meteor.py

Solution

The problem is that meteor.py uses word_tokenize as tokenizer and there doesn't seem to be a way to pass your tokenizer as argument (you could file a feature request so that the author adds it). You can however patch the tokenizer when creating metric_meteor:

import evaluate
from unittest.mock import patch
import nltk
from transformers import AutoTokenizer

reference1 = "犯人受到了嚴密的監控。" # Ground Truth
hypothesis1 = "犯人受到嚴密監視。" # Translated Sentence

tokenizer = AutoTokenizer.from_pretrained('fnlp/bart-base-chinese')

with patch.object(nltk, 'word_tokenize', tokenizer):
    metric_meteor = evaluate.load('meteor')

meteor = metric_meteor.compute(predictions=[hypothesis1], references=[reference1], alpha=0.9, beta=3.0, gamma=0.5)
print("METEOR:", meteor["meteor"])

Output:

METEOR: 0.9814814814814815