parsingrustgrammarlr-grammarambiguous-grammar

Preferring shift over reduce in parser for language without statement terminators


I'm parsing a language that doesn't have statement terminators like ;. Expressions are defined as the longest sequence of tokens, so 5-5 has to be parsed as a subtraction, not as two statements (literal 5 followed by a unary negated -5).

I'm using LALRPOP as the parser generator (despite the name, it is LR(1) instead of LALR, afaik). LALRPOP doesn't have precedence attributes and doesn't prefer shift over reduce by default like yacc would do. I think I understand how regular operator precedence is encoded in an LR grammar by building a "chain" of rules, but I don't know how to apply that to this issue.

The expected parses would be (individual statements in brackets):

"5 - 5"   → 5-5 instead of 5, -5
"5 (- 5)" → 5, -5
"- 5"     → -5
"5 5"     → 5, 5

How do I change the grammar such that it always prefers the longer parse?

Going through the first few pages of google results as well as stack overflow didn't yield any results for this specific problem. Most related questions need more lookahead or the result is to not allow consecutive statements without terminators.

I created a minimal sample grammar that reproduces the shift/reduce conflict (a statement in this grammar is just an expression, in the full grammar there would also be "if", "while", etc. and more levels of operator precedence, but I've omitted them for brevity). Besides unary minus, there are also other conflicts in the original grammar like print(5), which could be parsed as the identifier print and a parenthesized number (5) or a function call. There might be more conflicts like this, but all of them have the same underlying issue, that the longer sequence should be preferred, but both are currently valid, though only the first should be.

For convenience, I created a repo (checkout and cargo run). The grammar is:

use std::str::FromStr;

grammar;

match {
    "+",
    "-",
    "(",
    ")",
    r"[0-9]+",
    
    // Skip whitespace
    r"\s*" => { },
}

Expr: i32 = {
    <l:Expr> "+" <r:Unary> => l + r,
    <l:Expr> "-" <r:Unary> => l - r,
    Unary,
};

Unary: i32 = {
    "-" <r:Unary> => -r,
    Term,
}

Term: i32 = {
    Num,
    "(" <Expr> ")",
};

Num: i32 = {
    r"[0-9]+" => i32::from_str(<>).unwrap(),
};

Stmt: i32 = {
    Expr
};

pub Stmts: Vec<i32> = {
    Stmt*
};

Part of the error (full error message):

/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8: Local ambiguity detected

  The problem arises after having observed the following symbols in the input:
    Stmt+ Expr
  At that point, if the next token is a `"-"`, then the parser can proceed in two different ways.

  First, the parser could execute the production at
  /lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8, which would consume
  the top 1 token(s) from the stack and produce a `Stmt`. This might then yield a parse tree like
    Expr    ╷ Stmt
    ├─Stmt──┤    │
    ├─Stmt+─┘    │
    └─Stmt+──────┘

  Alternatively, the parser could shift the `"-"` token and later use it to construct a `Expr`. This might
  then yield a parse tree like
    Stmt+ Expr "-" Unary
    │     ├─Expr───────┤
    │     └─Stmt───────┤
    └─Stmt+────────────┘

  See the LALRPOP manual for advice on making your grammar LR(1).

Solution

  • The issue you're going to have to confront is how to deal with function calls. I can't really give you any concrete advice based on your question, because the grammar you provide lacks any indication of the intended syntax of functions calls, but the hint that print(5) is a valid statement makes it clear that there are two distinct situations, which need to be handled separately.

    Consider:

    5 - 5     One statement           5 ( - 5 )  Two statements
    print(-5) One statement           print - 5  Two statements (presumably)
    a - 5     ???
    

    The ambiguity of the third expression could be resolved if the compiler knew whether a is a function or a variable (if we assume that functions are not first-class values, making print an invalid statement). But there aren't many ways that the parser could know that, and none of them seem very likely:

    Since I don't know the precise requirements here, I'll show a grammar (in yacc/bison syntax, although it should be easy enough to translate it) which attempts to illustrate a representative sample. It implements one statement (return) in addition to expression statements, and expressions include multiplication, subtraction, negation and single argument function calls. To force "greedy" expressions, it prohibits certain statement sequences:

    Those rules are easy to state, but the actual implementation is annoyingly repetitive because it requires various different kinds of expressions, depending on what the first and last token in the expression is, and possibly different kinds of statements, if you have statements which might end with an expression. (return x, for example.) The formalism used by ECMAScript would be useful here, but I suspect that your parser-generator doesn't implement it -- although it's possible that its macro facility could be used to that effect, if it came with something resembling documentation. Without that, there is a lot of duplication.

    In a vague attempt to generate the grammar, I used the following suffixes:

    The absence of a suffix is used for the union of different possibilities. There are probably more unit productions than necessary. It has not been thoroughly debugged, but it worked on a few test cases (see below):

    program      : block
    
    block_id     : stmt_id
                 | block_id stmt_oth_id
                 | block_nid stmt_pr_id
                 | block_nid stmt_oth_id
    block_nid    : stmt_nid
                 | block_id stmt_oth_nid
                 | block_nid stmt_pr_nid
                 | block_nid stmt_oth_nid
    block        : %empty
                 | block_id | block_nid
    
    stmt_un_id   : expr_un_id
    stmt_un_nid  : expr_un_nid
    stmt_pr_id   : expr_pr_id
    stmt_pr_nid  : expr_pr_nid
    stmt_oth_id  : expr_oth_id
                 | return_id
    stmt_oth_nid : expr_oth_nid
                 | return_nid
    stmt_id      : stmt_un_id  | stmt_pr_id  | stmt_oth_id
    stmt_nid     : stmt_un_nid | stmt_pr_nid | stmt_oth_nid
    
    return_id    : "return" expr_id
    return_nid   : "return" expr_nid
    
    expr_un_id   : sum_un_id
    expr_un_nid  : sum_un_nid
    expr_pr_id   : sum_pr_id
    expr_pr_nid  : sum_pr_nid
    expr_oth_id  : sum_oth_id
    expr_oth_nid : sum_oth_nid
    expr_id      : expr_un_id  | expr_pr_id  | expr_oth_id
    expr_nid     : expr_un_nid | expr_pr_nid | expr_oth_nid
    expr         : expr_id | expr_nid
    
    sum_un_id    : mul_un_id
                 | sum_un '-' mul_id
    sum_un_nid   : mul_un_nid
                 | sum_un '-' mul_nid
    sum_un       : sum_un_id | sum_un_nid
    sum_pr_id    : mul_pr_id
                 | sum_pr '-' mul_id
    sum_pr_nid   : mul_pr_nid
                 | sum_pr '-' mul_nid
    sum_pr       : sum_pr_id | sum_pr_nid
    sum_oth_id   : mul_oth_id
                 | sum_oth '-' mul_id
    sum_oth_nid  : mul_oth_nid
                 | sum_oth '-' mul_nid
    sum_oth      : sum_oth_id | sum_oth_nid
    
    mul_un_id    : unary_un_id
                 | mul_un '*' unary_id
    mul_un_nid   : unary_un_nid
                 | mul_un '*' unary_nid
    mul_un       : mul_un_id | mul_un_nid
    mul_pr_id    : mul_pr '*' unary_id
    mul_pr_nid   : unary_pr_nid
                 | mul_pr '*' unary_nid
    mul_pr       : mul_pr_id | mul_pr_nid
    mul_oth_id   : unary_oth_id
                 | mul_oth '*' unary_id
    mul_oth_nid  : unary_oth_nid
                 | mul_oth '*' unary_nid
    mul_oth      : mul_oth_id | mul_oth_nid
    mul_id       : mul_un_id  | mul_pr_id  | mul_oth_id
    mul_nid      : mul_un_nid | mul_pr_nid | mul_oth_nid
    
    unary_un_id  : '-' unary_id
    unary_un_nid : '-' unary_nid
    unary_pr_nid : term_pr_nid
    unary_oth_id : term_oth_id
    unary_oth_nid: term_oth_nid
    unary_id     : unary_un_id  | unary_oth_id
    unary_nid    : unary_un_nid | unary_pr_nid | unary_oth_nid
    
    term_oth_id  : IDENT
    term_oth_nid : NUMBER
                 | IDENT '(' expr ')'
    term_pr_nid  : '(' expr ')'
    

    Here's a little test:

    > 5-5
    { [- 5 5] }
    > 5(-5)
    { 5; [~ -- 5] }
    > a-5
    { [- a 5] }
    > a(5)
    { [CALL a 5] }
    > -7*a
    { [* [~ -- 7] a] }
    > a*-7
    { [* a [~ -- 7]] }
    > a-b*c
    { [- a [* b c]] }
    > a*b-c
    { [- [* a b] c] }
    > a*b(3)-c
    { [- [* a [CALL b 3]] c] }
    > a*b-c(3)
    { [- [* a b] [CALL c 3]] }
    > a*b-7(3)
    { [- [* a b] 7]; 3 }