I am trying to analyze some documents by a grammar generated via Grako that should parse simple sentences for further analysis but face some difficulties with some special tokens.
The (Grako-style) EBNF looks like:
abbr::str = "etc." | "feat.";
word::str = /[^.]+/;
sentence::Sentence = content:{abbr | word} ".";
page::Page = content:{sentence};
I used the upper grammar on following content:
This is a sentence. This is a sentence feat. an abbrevation. I don't now feat. etc. feat. know English.
The result using a simple NodeWalker:
[
'This is a sentence.',
'This is a sentence feat.',
'an abbrevation.',
"I don't know feat.",
'etc. feat. know English.'
]
My expectation:
[
'This is a sentence.',
'This is a sentence feat. an abbrevation.',
"I don't know feat. etc. feat. know English."
]
I have no clue why this happens, especially in the last sentence where the abbreviations are part of the sentence while they are not in the prior sentences. To be clear, I want the abbr rule in the sentence definition to have a higher priority than the word rule, but I don't know how to achieve this. I played around with the negative and positive lookahead without success. I know how to achieve my expected results with regular expressions, but a context-free grammar is required for the further analysis, so I want to put everything in one grammar for the sake of readability. It has been a while since I last used grammars this way, but I don't remember running in that kind of problem. I searched a while via Google with no success, so maybe the community might share some insight.
Thanks in advance.
Code I used for testing, if required:
from grako.model import NodeWalker, ModelBuilderSemantics
from parser import MyParser
class MyWalker(NodeWalker):
def walk_Page(self, node):
content = [self.walk(c) for c in node.content]
print(content)
def walk_Sentence(self, node):
return ' '.join(node.content) + "."
def walk_str(self, node):
return node
def main(filename: str):
parser = MyParser(semantics=ModelBuilderSemantics())
with open(filename, 'r', encoding='utf-8') as src:
result = parser.parse(src.read(), 'page')
walker = HRBWalker()
walker.walk(result)
Packages used: Python 3.5.2 Grako 3.16.5
The problem is with the regular expression you're using for the word
rule. Regular expressions will parse over whatever you tell them to, and that regexp is eating over whitespace.
This modified grammar does what you want:
@@grammar:: Pages
abbr::str = "etc." | "feat.";
word::str = /[^.\s]+/;
sentence::Sentence = content:{abbr | word} ".";
page::Page = content:{sentence};
start = page ;
A --trace
run revealed the problem right away.