pythonregexnlpnltktext-chunking

NLTK RegEx Chunker not capturing defined grammar patterns with wildcards


I am trying to chunk a sentence using NLTK's POS tags as regular expressions. 2 rules are defined to identify phrases, based on the tags of words in the sentence.

Mainly, I wanted to capture the chunk of one or more verbs followed by an optional determiner and then one or more nouns at the end. This is the first rule in definition. But it is not getting captured as Phrase Chunk.

import nltk

## Defining the POS tagger 
tagger = nltk.data.load(nltk.tag._POS_TAGGER)


## A Single sentence - input text value
textv="This has allowed the device to start, and I then see glitches which is not nice."
tagged_text = tagger.tag(textv.split())

## Defining Grammar rules for  Phrases
actphgrammar = r"""
     Ph: {<VB*>+<DT>?<NN*>+}  # verbal phrase - one or more verbs followed by optional determiner, and one or more nouns at the end
     {<RB*><VB*|JJ*|NN*\$>} # Adverbial phrase - Adverb followed by adjective / Noun or Verb
     """

### Parsing the defined grammar for  phrases
actp = nltk.RegexpParser(actphgrammar)

actphrases = actp.parse(tagged_text)

The input to the chunker, tagged_text is as below.

tagged_text Out[7]: [('This', 'DT'), ('has', 'VBZ'), ('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN'), ('to', 'TO'), ('start,', 'NNP'), ('and', 'CC'), ('I', 'PRP'), ('then', 'RB'), ('see', 'VB'), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice.', 'NNP')]

In the final output, only the adverbial phrase ('then see'), that is matching the second rule is being captured. I expected the verbal phrase ('allowed the device') to match with the first rule and get captured as well, but its not.

actphrases Out[8]: Tree('S', [('This', 'DT'), ('has', 'VBZ'), ('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN'), ('to', 'TO'), ('start,', 'NNP'), ('and', 'CC'), ('I', 'PRP'), Tree('Ph', [('then', 'RB'), ('see', 'VB')]), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice.', 'NNP')])

NLTK version used is 2.0.5 (Python 2.7) Any help or suggestion would be greatly appreciated.


Solution

  • Close but minor changes to your regex will get you your desired output. When you want to get a wildcard using RegexpParser grammar, you should use .* instead of *, e.g. VB.* instead of VB*:

    >>> from nltk import word_tokenize, pos_tag, RegexpParser
    >>> text = "This has allowed the device to start, and I then see glitches which is not nice."
    >>> tagged_text = pos_tag(word_tokenize(text))    
    >>> g = r"""
    ... VP: {<VB.*><DT><NN.*>}
    ... """
    >>> p = RegexpParser(g); p.parse(tagged_text)
    Tree('S', [('This', 'DT'), ('has', 'VBZ'), Tree('VP', [('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN')]), ('to', 'TO'), ('start', 'VB'), (',', ','), ('and', 'CC'), ('I', 'PRP'), ('then', 'RB'), ('see', 'VBP'), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice', 'JJ'), ('.', '.')])
    

    Note that you're catching the Tree(AdvP, [('then', 'RB'), ('see', 'VB')]), because the tags are exactly RB and VB. So the wildcard in your grammar (i.e. `"""AdvP: {}""") in this scenario is ignored.

    Also, if it's two different types of phrases, it's more advisable to use 2 labels not one. And (i think) end of string after wildcard is sort of redundant, so it's better to:

    g = r"""
    VP:{<VB.*><DT><NN.*>} 
    AdvP: {<RB.*><VB.*|JJ.*|NN.*>}
    """