pythonnlpstanford-nlp

Stanford's Stanza NLP: find all words ids for a given span


I am using a Stanza pipeline that extracts both words and named entities.

The sentence.entities gives me a list of recognized named entities with their start and end characters. Here is an example:

{
  "text": "Dante Alighieri",
  "type": "PER",
  "start_char": 1,
  "end_char": 16
}

The sentence.words gives a list of all tokenized words also with their start and end characters: Here is a fragment of the corresponding example:

{
  "id": 1,
  "text": "Dante",
  "lemma": "Dante",
  "upos": "PROPN",
  "xpos": "SP",
  "head": 3,
  "deprel": "nsubj",
  "start_char": 1,
  "end_char": 6
}
{
  "id": 2,
  "text": "Alighieri",
  "lemma": "Alighieri",
  "upos": "PROPN",
  "xpos": "SP",
  "head": 1,
  "deprel": "flat:name",
  "start_char": 7,
  "end_char": 16
}
{
  "id": 3,
  "text": "scrisse",
  "lemma": "scrivere",
  "upos": "VERB",
  "xpos": "V",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 0,
  "deprel": "root",
  "start_char": 17,
  "end_char": 24
}

I need to generate a list of all words that are included in the named entity span. Using the above example those would be the words with Id 1 and 2 but not 3


Solution

  • You can build a mapping using first and last character to get the word ids. For instance here is a quick function I built to extract NER words text in a sentence:

    def find_entity_words(ent, sent):
        id_tree = {word.id:word for word in sent.words}
        start_tree = {word.start_char:word for word in sent.words}
        end_tree = {word.end_char:word for word in sent.words}
    
        firstword = start_tree[ent.start_char]
        lastword = end_tree[ent.end_char]
    
        word_ids = range(firstword.id, lastword.id+1)
        words = [id_tree[i] for i in word_ids]
    
        return [word.text for word in words]
    

    With an example:

    >>> doc = nlp_model('Jean-Claude Van Damme est un acteur.')
    >>> doc.ents
    [{
       "text": "Jean-Claude Van Damme",
       "type": "PER",
       "start_char": 0,
       "end_char": 21
     }]
    
    >>> find_entity_words(doc.ents[0],doc.sentences[0])
    ['Jean-Claude', 'Van', 'Damme']