parsingmenhir

Ambiguity when Parsing Preceding List


While writing parser code in Menhir, I keep coming across this design pattern that is becoming very frustrating. I'm trying to build a parser that accepts either "a*ba" or "bb". To do this, I'm using the following syntax (note that A* is the same as list(A)):

exp:
 | A*; B; A; {1}
 | B; B; {2}

However, this code fails to parse the string "ba". The menhir compiler also indicates that there are shift-reduce conflicts in the parser, specifically as follows:

** In state 0, looking ahead at B, shifting is permitted
** because of the following sub-derivation:

. B B 

** In state 0, looking ahead at B, reducing production
** list(A) -> 
** is permitted because of the following sub-derivation:

list(A) B A // lookahead token appears

So | B A requires a shift, while | A* B A requires a reduce when the first token is B. I can resolve this ambiguity manually and get the expected behavior by changing the expression to read as follows (note that A+ is the same as nonempty_list(A)):

exp2:
 | B; A; {1}
 | A+; B; A; {1}
 | B; B; {2}

In my mind, exp and exp2 read the same, but are clearly treated differently. Is there a way to write exp to do what I want without code duplication (which can cause other problems)? Is this a design pattern I should be avoiding entirely?


Solution

  • exp and exp2 parse the same language, but they're definitely not the same grammar. exp requires a two-symbol lookahead to correctly parse a sentence starting with B, for precisely the reason you've noted: the parser can't decide whether to insert an empty A* into the parse before it sees the symbol after the B, but it needs to do that insertion before it can handle the B. By contrast, exp2 does not need an empty production to create an empty list of As before B A, so no decision is necessary.

    You don't need a list to produce this conflict. Replacing A* with A? would produce exactly the same conflict.

    You've already found the usual solution to this shift-reduce conflict for LALR(1) parser generators: a little bit of redundancy. As you note, though, that solution is not ideal.

    Another common solution (but maybe not for menhir) involves using a right-recursive definition of a terminated list:

    prefix:
        | B;
        | A; prefix; 
    
    exp:
        | prefix; A;  { 1 }
        | B; B;       { 2 }
    

    As far as I know, menhir's standard library doesn't include a terminated list macro, but it would easy enough to write. It might look something like this:

    %public terminated_list(X, Y):
    | y = Y;
        { ( [], y ) }
    | x = X; xsy = terminated_list(X, Y);
        { ( x :: (fst xsy), (snd xsy) ) }
    

    There's probably a more idiomatic way to write that; I don't pretend to be an OCAML coder.