I am using PyTorch and a pre-trained model.
Here is my code:
class NER(object):
def __init__(self, model_name_or_path, tokenizer_name_or_path):
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
self.model = AutoModelForTokenClassification.from_pretrained(
model_name_or_path)
self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer)
def get_mention_entities(self, query):
return self.nlp(query)
when I call get_mention_entities
and print its output for "اینجا دانشگاه صنعتی امیرکبیر است."
it gives:
[{'entity': 'B-FAC', 'score': 0.9454591, 'index': 2, 'word': 'دانشگاه', 'start': 6, 'end': 13}, {'entity': 'I-FAC', 'score': 0.9713519, 'index': 3, 'word': 'صنعتی', 'start': 14, 'end': 19}, {'entity': 'I-FAC', 'score': 0.9860724, 'index': 4, 'word': 'امیرکبیر', 'start': 20, 'end': 28}]
As you can see, it can recognize the university name, but there are three tokens in the list.
Is there any standard way to combine these tokens based on the "entity" attribute?
desired output is something like:
[{'entity': 'FAC', 'word': 'دانشگاه صنعتی امیرکبیر', 'start': 6, 'end': 28}]
Finally, I can write a function to iterate, compare, and merge the tokens based on the "entity" attribute, but I want a standard way like an internal PyTorch function or something like this.
my question is similar to this question.
PS: "دانشگاه صنعتی امیرکبیر" is a university name.
Huggingface's NER pipeline has an argument grouped_entities=True
which will do exactly what you seek: group BI into unified entities.
Adding
self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer, grouped_entities=True)
should do the trick