nlpmaskhuggingface

How to do NLP fill-mask with restricted possible inputs


I want to use NLP to fill in a masked word in a text, but instead of choosing from all possible words I want to find which is the more likely of two candidate words. For example, imagine I have a sentence "The [MASK] was stuck in the tree" and I want to evaluate whether "kite" or "bike" is the more likely word.

I know how to find the globally most probable words using hugging face's fill-mask pipeline

from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM

# Define the input sentence with a masked word
input_text = "The [MASK] was stuck in the tree"

# Load the pre-trained model and tokenizer
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Tokenize the input sentence
tokenized_text = tokenizer.tokenize(input_text)

# Use the pipeline to generate a list of predicted words and probabilities
mlm = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = mlm(input_text)

# Print outputs
for result in results:
    token = result["token_str"]
    print(f"{token:<15} {result['score']}")

However if "bike" and "kite" arent in the first few most probable words this doesnt help.

How can I use fill-mask to find the probability of specific masks?

P.S. I'm not sure if overflow is the best place to post this question, there doesnt seem to be a place for nlp specific questions.


Solution

  • Since version 3.1.0 the fill-mask pipeline also supports the targets argument to do this directly in the pipeline:

    from transformers import pipeline
    
    pipe = pipeline("fill-mask", model="bert-base-cased")
    results = pipe("The [MASK] was stuck in the tree", targets=["bike", "kite"])
    
    for result in results:
        print(f'{result["token_str"]}: {result["score"]}')
    

    Note that in case any of the words are not in the vocabulary, it will instead return the score for its first subword token.