I was trying to implement a regex on a list of grammar tags in python, for finding the tense form of the list of grammar. And I wrote the following code to implement it.
Data preprocessing:
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
tags = []
for i in range(len(tagged)):
t = tagged[i]
tags.append(t[1])
print(tags)
regex formula i.e. to be implemented
grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}
Function to implement the regex on the list tags
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(grammar, tags)
But it returned an error as:
Traceback (most recent call last):
File "/home/samar/Desktop/twitter_tense/main.py", line 35, in <module>
check_grammar(grammar, tags)
File "/home/samar/Desktop/twitter_tense/main.py", line 31, in check_grammar
result = cp.parse(tags)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1276, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1083, in parse
chunkstr = ChunkString(chunk_struct)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in __init__
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in <listcomp>
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 105, in _tag
raise ValueError("chunk structures must contain tagged " "tokens or trees")
ValueError: chunk structures must contain tagged tokens or trees
Your call to the cp.parse()
function expects each of the tokens in your sentence to be tagged, however, the tags
list you created only contains the tags but not the tokens as well, hence your ValueError
. The solution is to instead pass the output from the pos_tag()
call (i.e. tagged
) to your check_grammar
call (see below).
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
print(tagged)
# Output
>>> [('He', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('been', 'VBN'), ('doing', 'VBG'), ('his', 'PRP$'), ('homework', 'NN'), ('.', '.')]
my_grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}"""
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(my_grammar, tagged)
>>> (S
>>> He/PRP
>>> (Future_Perfect_Continuous will/MD have/VB been/VBN doing/VBG)
>>> his/PRP$
>>> homework/NN
>>> ./.)