I am using a Stanza pipeline that extracts both words and named entities.
The sentence.entities gives me a list of recognized named entities with their start and end characters. Here is an example:
{
"text": "Dante Alighieri",
"type": "PER",
"start_char": 1,
"end_char": 16
}
The sentence.words gives a list of all tokenized words also with their start and end characters: Here is a fragment of the corresponding example:
{
"id": 1,
"text": "Dante",
"lemma": "Dante",
"upos": "PROPN",
"xpos": "SP",
"head": 3,
"deprel": "nsubj",
"start_char": 1,
"end_char": 6
}
{
"id": 2,
"text": "Alighieri",
"lemma": "Alighieri",
"upos": "PROPN",
"xpos": "SP",
"head": 1,
"deprel": "flat:name",
"start_char": 7,
"end_char": 16
}
{
"id": 3,
"text": "scrisse",
"lemma": "scrivere",
"upos": "VERB",
"xpos": "V",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
"head": 0,
"deprel": "root",
"start_char": 17,
"end_char": 24
}
I need to generate a list of all words that are included in the named entity span. Using the above example those would be the words with Id 1 and 2 but not 3
You can build a mapping using first and last character to get the word ids. For instance here is a quick function I built to extract NER words text in a sentence:
def find_entity_words(ent, sent):
id_tree = {word.id:word for word in sent.words}
start_tree = {word.start_char:word for word in sent.words}
end_tree = {word.end_char:word for word in sent.words}
firstword = start_tree[ent.start_char]
lastword = end_tree[ent.end_char]
word_ids = range(firstword.id, lastword.id+1)
words = [id_tree[i] for i in word_ids]
return [word.text for word in words]
With an example:
>>> doc = nlp_model('Jean-Claude Van Damme est un acteur.')
>>> doc.ents
[{
"text": "Jean-Claude Van Damme",
"type": "PER",
"start_char": 0,
"end_char": 21
}]
>>> find_entity_words(doc.ents[0],doc.sentences[0])
['Jean-Claude', 'Van', 'Damme']