I have found this code here:
# Import required libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize, RegexpParser
# Example text
sample_text = "The quick brown fox jumps over the lazy dog"
# Find all parts of speech in above sentence
tagged = pos_tag(word_tokenize(sample_text))
#Extract all parts of speech from any text
chunker = RegexpParser("""
NP: {<DT>?<JJ>*<NN>} #To extract Noun Phrases
P: {<IN>} #To extract Prepositions
V: {<V.*>} #To extract Verbs
PP: {<P> <NP>} #To extract Prepostional Phrases
VP: {<V> <NP|PP>*} #To extarct Verb Phrases
""")
# Print all parts of speech in above sentence
output = chunker.parse(tagged)
print("After Extracting\n", output)
As I understand, this code defines PP, NP and VP... My doubt is that the syntactic tags are already defined here. Aren't these composed tags defined in NLTK? Is that the point? Furthermore, in the last row of the chunker {<V> <NP|PP>*}
, is it using the above-defined NP: {<DT>?<JJ>*<NN>}
and PP: {<P> <NP>}
?
In the example you found the idea is to use the conventional names for syntactic constituent elements of sentences to create a chunker - a parser that breaks down sentences to a desired level of rather coarse-grained pieces. This simple(istic?) approach is used in favour of a full syntactic parse - which would require breaking the utterances down to word-level and labelling each word with appropriate function in the sentence.
The grammar defined in the parameter of RegexParser
is to be chosen arbitrarily depending on the need (and structure of the utterances it is to apply to). These rules can be recurrent - they correspond to the ones of BNF formal grammar. Your observation is then valid - the last rule for VP
refers to the previously defined rules.