grammar bison ambiguous-grammar reduce-reduce-conflict

Bison reduce/reduce conflict between lambda expression and parenthesized identifier

I'm attempting to write my own programming language at the moment, and I have the following simplified grammar:

...

%%

prog: stmtlist | %empty;

block: "{" stmtlist "}";

stmtlist: stmtlist ";" stmt | stmt;
decllist: decllist "," decl | decl;
exprlist: exprlist "," expr | expr;

stmt: stmt_decl;

stmt_decl
    : "let" decllist "=" exprlist
    | "let" decllist
    ;

decl: IDENTIFIER ":" IDENTIFIER | IDENTIFIER;

expr: expr_function;

expr_function
    : "(" decllist ")" "->" expr_function
    | "(" decllist ")" "->" block
    | "("          ")" "->" expr_function
    | "("          ")" "->" block
    | expr_additive
    ;

expr_additive
    : expr_additive "+" expr_primary
    | expr_additive "-" expr_primary
    | expr_primary
    ;

expr_primary: INTVAL | FLTVAL | IDENTIFIER | "(" expr ")";

%%

...

However, when I attempt to generate the C++ parser, I get a single reduce/reduce conflict.

$ bison -v -Werror -l --defines=parser.hh -o parser.cc parser.yy
parser.yy: error: 1 reduce/reduce conflict [-Werror=conflicts-rr]
parser.yy: note: rerun with option '-Wcounterexamples' to generate conflict counterexamples

Here is the relevant section of the produced parser.output file:

State 27 conflicts: 1 reduce/reduce

...

State 27

   13 decl: "identifier" • ":" "identifier"
   14     | "identifier" •
   26 expr_primary: "identifier" •

    ":"  shift, and go to state 11

    "+"       reduce using rule 26 (expr_primary)
    "-"       reduce using rule 26 (expr_primary)
    ")"       reduce using rule 14 (decl)
    ")"       [reduce using rule 26 (expr_primary)]
    $default  reduce using rule 14 (decl)

And here is a screenshot of the counterexample, which shows that the issue is an ambiguity between lambda expressions and parenthesized identifiers.

In my language, I'd like to be able to support the following syntax. I think I understand the issue, but I'm having difficulty resolving the reduce/reduce conflict.

let x = 1

let foo = (x)           // `foo` is the same type as `x`
let bar = (y) -> y + 1  // `bar` is a function

Could someone please tell me what I need to do to get it working, or point me to some resources that will help me figure it out?

Solution

It's not really an ambiguity. As far as I can see, your language is unambiguous. However, the grammar is not deterministic when restricted to a single lookahead token.

The extremely useful counterexample generator does indeed describe the problem. When the parser sees the ) in let foo = (x) it has to decide whether x is a decl or an expr_primary. The answer will be obvious as soon as the second-next token is seen; if the ) is followed by ->, then the parentheses contain a decl_list; otherwise, the parentheses contain an expr. So there's no ambiguity. But that doesn't necessarily help you :-)

LALR(2) grammars -- which is what you have -- are a perennial problem, and there are three basic strategies to solve them:

Use a GLR parser. This is certainly the easiest strategy; all it requires is to add %glr-parser to your parser declarations. (You might also need to update your bison version and possibly specify a different parser skeleton, depending on your bison version, if your semantic type is not compatible with C.) GLR parsers can parse any unambiguous grammar, regardless of how much lookahead might be required. The extra flexibility comes at a cost, but in the case of using GLR to parse an LALR(2) grammar, the cost is almost negligible.

Note that even if you ask for a GLR parser, Bison will still report the conflict, on the assumption that you wanted to know about it. But it no longer matters, because the GLR parser can handle multiple possible parses at the same time, as long as the grammar is (eventually) unambiguous. Bison pretty well has to report the conflict, because a conflict could be the result of an ambiguity, in which case you'd need to do something. You can suppress the report using an %expect-rr declaration.

There is no algorithm which can tell whether an arbitrary grammar is ambiguous. Bison does its best with the counterexamples report, but it doesn't always work and it certainly doesn't always indicate an ambiguity. But if the grammar happens to not be ambiguous, a GLR parser will work.

With GLR parsers, ambiguity is reported as a run-time error. That's maybe not ideal, but since there is no way to tell in advance, that's the best you can do. Other GLR parser generators will return both (or all) possible parses, which you can do with Bison using a custom ambiguity resolver, but in practical applications you generally want the grammar to be unambiguous. If Bison reports a conflict, you should test the parser with relevant inputs and ensure that it doesn't fail with an ambiguity message.
Change the grammar so that it's LALR(1). This is always possible because every LR(k) language has an LALR(1) grammar. There's even a (fairly) simple algorithm which can be used to convert an LALR(k) grammar into an LALR(1) grammar provided that k has a known value. Unfortunately, the algorithm produces enormous grammars, which become extremely hard to maintain. (I guess that would be OK if bison came with an automatic rewriter, but it doesn't.) So you'd probably be better off trying to rejig the grammar by hand, which isn't too awful because there's only one conflict which requires two lookahead tokens and you already know what it is.

The rough outline of a solution goes like this:

The problem is that the parser can't tell what ( IDENTIFIER ) is until it sees the next token. So all we need to do is make it unnecessary for the parser to perform a unit reduction on that particular IDENTIFIER. To do that, we can create a redundant non-terminal and add productions which use it:
```
parenthesized_id: "(" IDENTIFIER ")"
expr_primary: parenthesized_id
expr_function: parenthesized_id "->" block
             | parenthesized_id "->" expr_function
```
That's the easy part, and in a certain sense it's all that is needed (at least, in this case). If you add those rules to your grammar, bison will report that you now have two shift-reduce conflicts and a reduce-reduce conflict, but that's a bit misleading since the reduce-reduce conflict is in the same state as one of the shift-reduce conflicts, and involves the same potential reductions. What we want in the conflict states is for the parser to not perform any possible reduction, preferring to shift the ) in order to later reduce a parenthesized_id. And that's what bison (or yacc, for that matter) will do by default; in the absence of any precedence declarations, it resolves conflicts in favour of the shift, on the basis that shifting is often the right answer. But it still reports the conflicts; if you want to suppress the warnings, you can add a slightly hacky precedence declaration which does the same thing (prefer shifting ")" to reducing a production ending with IDENTIFIER, if both actions are possible):
```
%precedence IDENTIFIER
%precedence ")"
```
If you really wanted to, you could work out how to rewrite the rules for parentheses so that neither expr_function nor expr_term can produce "(" IDENTIFIER ")", thereby proving (at least for this LALR(2) grammar) that there is an LALR(1) grammar which covers the same language. It's not pretty but it's certainly feasible. You can see an example, very similar to your grammar, in this answer from 2013.

With this solution, you can produce a parser which you know is not ambiguous. But you should still test, to make sure that the conflict resolution (or the complicated additional rules) actually parse the language you're interested in parsing.
Finally, there is the old school version (actually used by various bison/yacc grammar parsers to deal with the conflict caused by the optionality of the semicolon in a production rule): get rid of the two-token lookahead by combining the two tokens in the lexer. In other words, add to your lexer something like:
```
[)][[:space:]]*->   return CLOSE_ARROW;
```
and then change the various productions for the lambda syntax by substituting ')' with CLOSE_ARROW. As written, that pattern doesn't allow the user to put comments between the close parenthesis and the arrow, but that shouldn't be too much of a problem. The bison lexer uses a more complicated strategy, which in this case would be to always buffer ) and continue the scan; if it turns out that the next token to be returned (i.e. neither whitespace nor comment) is ->, then the combined token can be returned; otherwise, the two tokens —) and whatever followed it — need to be returned separately.