While writing parser code in Menhir, I keep coming across this design pattern that is becoming very frustrating. I'm trying to build a parser that accepts either "a*ba" or "bb". To do this, I'm using the following syntax (note that A*
is the same as list(A)
):
exp:
| A*; B; A; {1}
| B; B; {2}
However, this code fails to parse the string "ba". The menhir compiler also indicates that there are shift-reduce conflicts in the parser, specifically as follows:
** In state 0, looking ahead at B, shifting is permitted
** because of the following sub-derivation:
. B B
** In state 0, looking ahead at B, reducing production
** list(A) ->
** is permitted because of the following sub-derivation:
list(A) B A // lookahead token appears
So | B A
requires a shift, while | A* B A
requires a reduce when the first token is B
. I can resolve this ambiguity manually and get the expected behavior by changing the expression to read as follows (note that A+
is the same as nonempty_list(A)
):
exp2:
| B; A; {1}
| A+; B; A; {1}
| B; B; {2}
In my mind, exp
and exp2
read the same, but are clearly treated differently. Is there a way to write exp
to do what I want without code duplication (which can cause other problems)? Is this a design pattern I should be avoiding entirely?
exp
and exp2
parse the same language, but they're definitely not the same grammar. exp
requires a two-symbol lookahead to correctly parse a sentence starting with B
, for precisely the reason you've noted: the parser can't decide whether to insert an empty A*
into the parse before it sees the symbol after the B, but it needs to do that insertion before it can handle the B
. By contrast, exp2
does not need an empty production to create an empty list of A
s before B A
, so no decision is necessary.
You don't need a list to produce this conflict. Replacing A*
with A?
would produce exactly the same conflict.
You've already found the usual solution to this shift-reduce conflict for LALR(1) parser generators: a little bit of redundancy. As you note, though, that solution is not ideal.
Another common solution (but maybe not for menhir) involves using a right-recursive definition of a terminated list:
prefix:
| B;
| A; prefix;
exp:
| prefix; A; { 1 }
| B; B; { 2 }
As far as I know, menhir's standard library doesn't include a terminated list macro, but it would easy enough to write. It might look something like this:
%public terminated_list(X, Y):
| y = Y;
{ ( [], y ) }
| x = X; xsy = terminated_list(X, Y);
{ ( x :: (fst xsy), (snd xsy) ) }
There's probably a more idiomatic way to write that; I don't pretend to be an OCAML coder.