I'm using Menhir to create a parser, and there's a behaviour that always trips me, and I don't understand it. I have created the following minimal example to demonstrate it; this shows the declaration of the receiver argument in a method declaration in the Go language (http://golang.org/ref/spec#Method_declarations):
%{
%}
%token <string> T_identifier
%token T_star
%start <unit> demo
%%
(* This rule has a shift/reduce conflict
demo:
| option(T_identifier) option(T_star) T_identifier { () }
*)
(* This rule is okay. *)
demo:
| T_identifier T_star T_identifier { () }
| T_identifier T_identifier { () }
| T_star T_identifier { () }
| T_identifier { () }
As far as I can see, both rules are semantically equivalent: we are looking for an optional identifier (the name of the receiver), an optional star (pointer or not) and a mandatory type name (the type of the receiver). However, the first rule (the one commented out) gives a shift/reduce conflict while the second rule works fine.
I've been able to progress in my parser by replacing option
with multiple rules whenever this has happened, but it's been nagging me that I don't understand why it's happening.
(If you don't know menhir, it is an LR(1) parser generator, so knowledge of how other similar tools work would probably apply.)
I suppose that Menhir reduces EBNF to BNF through some standard transformations. That's pretty common. Unfortunately these transformations can undermine LR(1) parsability.
Consider your rule, in another EBNF-like syntax:
demo → IDENTIFIER? STAR? IDENTIFIER
One way to translate that into BNF would be as you do in your second set of rules: define four different rules, one for each possibility. That transformation will never alter LR(1) parsability, and it is always possible for rules with an "optional" operator, but it has two disadvantages:
If there are several optional elements in a rule, the end result is a lot of productions.
It doesn't work for repetition operators.
Another way which seems more general is to create a new non-terminal for each extended BNF operator. So we could do this:
optional_identifier → IDENTIFIER | ε
optional_star → STAR | ε
demo → optional_identifier optional_star IDENTIFIER
A similar transformation would work for x*
:
repeated_x → ε | repeated_x x
That certainly produces an equivalent language, but now the grammar might not be LR(1).
In particular, demo
is no longer LR(1). It fails right at the beginning. Say the first input token is IDENTIFIER
. That could be the start of
IDENTIFIER IDENTIFIER
or
IDENTIFIER
(or some other possibilities, but that's enough to show the problem.)
In the second case (just a type), we need to reduce an optional_identifier
and an optional_star
before we can shift the IDENTIFIER
. In the first case (a variable and a type), we need to shift the IDENTIFIER
immediately. And the only information we have available to tell the difference is the lookahead token, IDENTIFIER
, which clearly isn't enough.
If we use the four-way expanded production, then there is no problem:
demo → IDENTIFIER
| STAR IDENTIFIER
| IDENTIFIER IDENTIFIER
| IDENTIFIER STAR IDENTIFIER
Here, when we see an IDENTIFIER
, we don't know whether it's part of the first production, the third production or the fourth production. But it doesn't matter because in all cases, we just shift the IDENTIFIER
and wait for more information.
A similar phenomenon occurs with yacc/bison
and other parser generators which allow mid-rule actions (MRAs). An MRA is turned into a new non-terminal whose only production is an ε production; the purpose of the new non-terminal is to run the MRA when it is reduced. That's really cool, except that sometimes the new non-terminal is introduced at a point where we can't know whether it is appropriate to reduce it or not. So the MRA can turn a perfectly good LR(1) grammar into a non-LR(1) grammar, even though the language has not changed.
Although not relevant in the case of Menhir, it's possibly interesting that the EBNF transformation above, if done carefully, does not introduce ambiguity which was not otherwise present. So even if the resulting grammar is no longer LR(1), it is still unambiguous and could be parsed with a GLR parser. However, since Menhir does not, as far as I know, generate GLR parsers, that fact might not be very useful.