[SOLVED] how to get mentions in pytorch NER instead of toknes?

how to get mentions in pytorch NER instead of toknes?

I am using PyTorch and a pre-trained model.

Here is my code:

class NER(object):
    def __init__(self, model_name_or_path, tokenizer_name_or_path):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
        self.model = AutoModelForTokenClassification.from_pretrained(
            model_name_or_path)
        self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer)

    def get_mention_entities(self, query):
        return self.nlp(query)

when I call get_mention_entities and print its output for "اینجا دانشگاه صنعتی امیرکبیر است."

it gives:

[{'entity': 'B-FAC', 'score': 0.9454591, 'index': 2, 'word': 'دانشگاه', 'start': 6, 'end': 13}, {'entity': 'I-FAC', 'score': 0.9713519, 'index': 3, 'word': 'صنعتی', 'start': 14, 'end': 19}, {'entity': 'I-FAC', 'score': 0.9860724, 'index': 4, 'word': 'امیرکبیر', 'start': 20, 'end': 28}]

As you can see, it can recognize the university name, but there are three tokens in the list.

Is there any standard way to combine these tokens based on the "entity" attribute?

desired output is something like:

[{'entity': 'FAC', 'word': 'دانشگاه صنعتی امیرکبیر', 'start': 6, 'end': 28}]

Finally, I can write a function to iterate, compare, and merge the tokens based on the "entity" attribute, but I want a standard way like an internal PyTorch function or something like this.

my question is similar to this question.

PS: "دانشگاه صنعتی امیرکبیر" is a university name.

Solution

Huggingface's NER pipeline has an argument grouped_entities=True which will do exactly what you seek: group BI into unified entities.

Adding

self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer, grouped_entities=True)

should do the trick