I've been trying to work out the basic skeleton for a language I've been designing, and I'm attempting to use Parsimonious to do the parsing for me. As of right now, I've, declared the following grammar:
grammar = Grammar(
"""
program = expr*
expr = _ "{" lvalue (rvalue / expr)* "}" _
lvalue = _ ~"[a-z0-9\\-]+" _
rvalue = _ ~".+" _
_ = ~"[\\n\\s]*"
"""
)
When I try to output the resulting AST of a simple input string like "{ do-something some-argument }"
:
print(grammar.parse("{ do-something some-argument }"))
Parsimonious decides to flat-out reject it, and then gives me this somewhat cryptic error:
Traceback (most recent call last): File "tests.py", line 13, in <module> print(grammar.parse("{ do-something some-argument }")) File "/usr/local/lib/python2.7/dist-packages/parsimonious/grammar.py", line 112, in parse return self.default_rule.parse(text, pos=pos) File "/usr/local/lib/python2.7/dist-packages/parsimonious/expressions.py", line 109, in parse raise IncompleteParseError(text, node.end, self) parsimonious.exceptions.IncompleteParseError: Rule 'program' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with '{ do-something some-' (line 1, column 1).
At first I thought this might be an issue related to my whitespace rule, _
, but after a few failed attempts at removing the whitespace rule in certain places, I was still coming up with the same error.
I've tried searching online, but all I've found that seems to be remotely related, is this question, which didn't help me in any way.
Am I doing something wrong with my grammar? Am I not parsing the input in the correct way? If anyone has a possible solution to this, it'd be greatly appreciated.
I am very far from an expert on Parsimonious, but I believe the problem is that ~".+"
is greedily matching the whole remainder of the input string, leaving nothing to match the rest of the production. I initially tested that idea by changing the regex for rvalue
to ~"[a-z0-9\\-]+"
, same as the one you have for lvalue
. Now it parses, and (awesomely) distinguishes by context between the two identically defined tokens lvalue
and rvalue
.
from parsimonious.grammar import Grammar
grammar = Grammar(
"""
program = expr*
expr = _ "{" lvalue (rvalue / expr)* "}" _
lvalue = _ ~"[a-z0-9\\-]+" _
rvalue = _ ~"[a-z0-9\\-]+" _
_ = ~"[\\n\\s]*"
"""
)
print(grammar.parse( "{ do-something some-argument }"))
If you mean for rvalue
to match any sequence of non-whitespace characters, you want something more like this:
rvalue = _ ~"[^\\s\\n]+" _
But whoops!
{ foo bar }
"}"
is a closing curly brace, but it's also a sequence of one or more non-whitespace characters. Is it "}"
or rvalue
? The grammar says the next token can be either of those. One of those interpretations is parsable and the other isn't, but Parsimonious just says it's spinach and the hell with it. I don't know if a parsing maven would consider that a legitimate way to resolve the ambiguity (e.g. maybe such a grammar may result in cases with two possible interpretations that both parse), or how practical that would be to implement. In any case Parsimonious doesn't make that call.
So we need to repel boarders on the curly brace issue. I think this grammar does what you want:
from parsimonious.grammar import Grammar
grammar = Grammar(
"""
program = expr*
expr = _ "{" lvalue (expr / rvalue)* "}" _
lvalue = _ ~"[a-z0-9\\-]+" _
rvalue = _ ~"[^{}\\n\\s]+" _
_ = ~"[\\n\\s]*"
"""
)
print(grammar.match( "{ do-something some-argument 23423 {foo bar} &^%$ }"))
I excluded open curly brace as well, because how would you expect this string to tokenize?
{foo bar{baz poo}}
I would expect
"{" "foo" "bar" "{" "baz" "poo" "}" "}"
...because if "poo}"
is expected to tokenize as "poo"
"}"
, and "{foo"
is expected to tokenize as "{"
"foo"
, then treating bar{baz
as "bar{baz"
or "bar{"
"baz"
is derangedcounterintuitive.
Now I remember how my bitter hatred of yacc drove me to an obsession with it.