pythontreenltkchunking

NLTK linguistic tree traversal and extract noun phrase (NP)


I created a custom classifier based chunker: DigDug_classifier, which chunks the following sentence:

sentence = "There is high signal intensity evident within the disc at T1."

To create these chunks:

(S
  (NP There/EX)
  (VP is/VBZ)
  (NP high/JJ signal/JJ intensity/NN evident/NN)
  (PP within/IN)
  (NP the/DT disc/NN)
  (PP at/IN)
  (NP T1/NNP)
  ./.)

I need to create a list of just the NP from the above, like this:

NP = ['There', 'high signal intensity evident', 'the disc', 'T1']

I wrote the following code:

output = []
for subtree in DigDug_classifier.parse(pos_tags): 
    try:
        if subtree.label() == 'NP': output.append(subtree)
    except AttributeError:
        output.append(subtree)
print(output)

But that gives me this answer instead:

[Tree('NP', [('There', 'EX')]), Tree('NP', [('high', 'JJ'), ('signal', 'JJ'), ('intensity', 'NN'), ('evident', 'NN')]), Tree('NP', [('the', 'DT'), ('disc', 'NN')]), Tree('NP', [('T1', 'NNP')]), ('.', '.')]

What can I do to get the desired answer?


Solution

  • First, see How to Traverse an NLTK Tree object?

    Specific to your question of extraction NP:

    >>> from nltk import Tree
    >>> parse_tree = Tree.fromstring("""(S
    ...   (NP There/EX)
    ...   (VP is/VBZ)
    ...   (NP high/JJ signal/JJ intensity/NN evident/NN)
    ...   (PP within/IN)
    ...   (NP the/DT disc/NN)
    ...   (PP at/IN)
    ...   (NP T1/NNP)
    ...   ./.)""")
    
    # Iterating through the parse tree and 
    # 1. check that the subtree is a Tree type and 
    # 2. make sure the subtree label is NP
    >>> [subtree for subtree in parse_tree if type(subtree) == Tree and subtree.label() == "NP"]
    [Tree('NP', ['There/EX']), Tree('NP', ['high/JJ', 'signal/JJ', 'intensity/NN', 'evident/NN']), Tree('NP', ['the/DT', 'disc/NN']), Tree('NP', ['T1/NNP'])]
    
    # To access the item inside the Tree object, 
    # use the .leaves() function
    >>> [subtree.leaves() for subtree in parse_tree if type(subtree) == Tree and subtree.label() == "NP"]
    [['There/EX'], ['high/JJ', 'signal/JJ', 'intensity/NN', 'evident/NN'], ['the/DT', 'disc/NN'], ['T1/NNP']]
    
    # To get the string representation of the leaves
    # use " ".join()
    >>> [' '.join(subtree.leaves()) for subtree in parse_tree if type(subtree) == Tree and subtree.label() == "NP"]
    ['There/EX', 'high/JJ signal/JJ intensity/NN evident/NN', 'the/DT disc/NN', 'T1/NNP']
    
    
    # To just get the leaves' string, 
    # iterate through the leaves and split the string and
    # keep the first part of the "/"
    >>> [" ".join([leaf.split('/')[0] for leaf in subtree.leaves()]) for subtree in parse_tree if type(subtree) == Tree and subtree.label() == "NP"]
    ['There', 'high signal intensity evident', 'the disc', 'T1']