pythonnlpstanford-nlpstanford-stanza

Extract Noun Phrases with Stanza and CoreNLPClient


I am trying to extract noun phrases from sentences using Stanza(with Stanford CoreNLP). This can only be done with the CoreNLPClient module in Stanza.

# Import client module
from stanza.server import CoreNLPClient
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse'], memory='4G', endpoint='http://localhost:9001')

Here is an example of a sentence, and I am using the tregrex function in client to get all the noun phrases. Tregex function returns a dict of dicts in python. Thus I needed to process the output of the tregrex before passing it to the Tree.fromstring function in NLTK to correctly extract the Noun phrases as strings.

pattern = 'NP'
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
matches = client.tregrex(text, pattern) ``

Hence, I came up with the method stanza_phrases which has to loop through the dict of dicts which is the output of tregrex and correctly format for Tree.fromstring in NLTK.

def stanza_phrases(matches):
  Nps = []
  for match in matches:
    for items in matches['sentences']:
      for keys,values in items.items():
        s = '(ROOT\n'+ values['match']+')'
        Nps.extend(extract_phrase(s, pattern))
  return set(Nps)

generates a tree to be used by NLTK

from nltk.tree import Tree
def extract_phrase(tree_str, label):
    phrases = []
    trees = Tree.fromstring(tree_str)
    for tree in trees:
        for subtree in tree.subtrees():
            if subtree.label() == label:
                t = subtree
                t = ' '.join(t.leaves())
                phrases.append(t)

    return phrases

Here is my output:

{'Albert Einstein', 'He', 'a German-born theoretical physicist', 'relativity',  'the theory', 'the theory of relativity'}

Is there a way I can make this more code efficient with less number of lines (especially, stanza_phrases and extract_phrase methods)


Solution

  • from stanza.server import CoreNLPClient
    
    # get noun phrases with tregex
    def noun_phrases(_client, _text, _annotators=None):
        pattern = 'NP'
        matches = _client.tregex(_text,pattern,annotators=_annotators)
        print("\n".join(["\t"+sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence]))
    
    # English example
    with CoreNLPClient(timeout=30000, memory='16G') as client:
        englishText = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
        print('---')
        print(englishText)
        noun_phrases(client,englishText,_annotators="tokenize,ssplit,pos,lemma,parse")
    
    # French example
    with CoreNLPClient(properties='french', timeout=30000, memory='16G') as client:
        frenchText = "Je suis John."
        print('---')
        print(frenchText)
        noun_phrases(client,frenchText,_annotators="tokenize,ssplit,mwt,pos,lemma,parse")