Stanford's Stanza NLP: find all words ids for a given span

I am using a Stanza pipeline that extracts both words and named entities.

The sentence.entities gives me a list of recognized named entities with their start and end characters. Here is an example:

{
  "text": "Dante Alighieri",
  "type": "PER",
  "start_char": 1,
  "end_char": 16
}

The sentence.words gives a list of all tokenized words also with their start and end characters: Here is a fragment of the corresponding example:

{
  "id": 1,
  "text": "Dante",
  "lemma": "Dante",
  "upos": "PROPN",
  "xpos": "SP",
  "head": 3,
  "deprel": "nsubj",
  "start_char": 1,
  "end_char": 6
}
{
  "id": 2,
  "text": "Alighieri",
  "lemma": "Alighieri",
  "upos": "PROPN",
  "xpos": "SP",
  "head": 1,
  "deprel": "flat:name",
  "start_char": 7,
  "end_char": 16
}
{
  "id": 3,
  "text": "scrisse",
  "lemma": "scrivere",
  "upos": "VERB",
  "xpos": "V",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 0,
  "deprel": "root",
  "start_char": 17,
  "end_char": 24
}

I need to generate a list of all words that are included in the named entity span. Using the above example those would be the words with Id 1 and 2 but not 3

Solution

You can build a mapping using first and last character to get the word ids. For instance here is a quick function I built to extract NER words text in a sentence:

def find_entity_words(ent, sent):
    id_tree = {word.id:word for word in sent.words}
    start_tree = {word.start_char:word for word in sent.words}
    end_tree = {word.end_char:word for word in sent.words}

    firstword = start_tree[ent.start_char]
    lastword = end_tree[ent.end_char]

    word_ids = range(firstword.id, lastword.id+1)
    words = [id_tree[i] for i in word_ids]

    return [word.text for word in words]

With an example:

>>> doc = nlp_model('Jean-Claude Van Damme est un acteur.')
>>> doc.ents
[{
   "text": "Jean-Claude Van Damme",
   "type": "PER",
   "start_char": 0,
   "end_char": 21
 }]

>>> find_entity_words(doc.ents[0],doc.sentences[0])
['Jean-Claude', 'Van', 'Damme']