I want to use NLP to fill in a masked word in a text, but instead of choosing from all possible words I want to find which is the more likely of two candidate words. For example, imagine I have a sentence "The [MASK] was stuck in the tree" and I want to evaluate whether "kite" or "bike" is the more likely word.
I know how to find the globally most probable words using hugging face's fill-mask pipeline
from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
# Define the input sentence with a masked word
input_text = "The [MASK] was stuck in the tree"
# Load the pre-trained model and tokenizer
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Tokenize the input sentence
tokenized_text = tokenizer.tokenize(input_text)
# Use the pipeline to generate a list of predicted words and probabilities
mlm = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = mlm(input_text)
# Print outputs
for result in results:
token = result["token_str"]
print(f"{token:<15} {result['score']}")
However if "bike" and "kite" arent in the first few most probable words this doesnt help.
How can I use fill-mask to find the probability of specific masks?
P.S. I'm not sure if overflow is the best place to post this question, there doesnt seem to be a place for nlp specific questions.
Since version 3.1.0 the fill-mask pipeline also supports the targets
argument to do this directly in the pipeline:
from transformers import pipeline
pipe = pipeline("fill-mask", model="bert-base-cased")
results = pipe("The [MASK] was stuck in the tree", targets=["bike", "kite"])
for result in results:
print(f'{result["token_str"]}: {result["score"]}')
Note that in case any of the words are not in the vocabulary, it will instead return the score for its first subword token.