pythonebnflark-parser

Disambiguating expressions in Lark syntax


I have a (Lark) grammar that I think should be unambiguous, but depending on the version of Lark, fails to parse in one way or another:

import lark

syntax = r"""
stmt: mov_stmt
    | special_stmt

mov_stmt: reg ASSIGN (reg | const)
special_stmt: ("RS" SPECIAL_ASSIGN const)

reg: REG
const: DEC_NUM

REG.2: /R[0-7]|RS/
DEC_NUM: /0|[1-9]\d*/i

ASSIGN: "="
SPECIAL_ASSIGN: "&="

WS: /[ \t]+/
%ignore WS
"""

parser = lark.Lark(syntax, start="stmt", parser="lalr")

print(parser.parse("R3 = 7"))   # 1. ok
print(parser.parse("R3 = R7"))  # 2. ok
print(parser.parse("RS &= 1"))  # 3. Fails on lark==1.1.9; expected special_stmt
print(parser.parse("RS = R7"))  # 4. Fails on lark-parser==0.12.0; expected mov_stmt

With lark-parser==0.12.0, invocation number 4. fails. I expect a mov_stmt, but it is expecting a SPECIAL_ASSIGN token, meaning it is matching special_stmt.

lark.exceptions.UnexpectedToken: Unexpected token Token('ASSIGN', '=') at line 1, column 4.
Expected one of:
        * SPECIAL_ASSIGN
Previous tokens: [Token('RS', 'RS')]

With lark==1.1.9, the opposite happens and invocation number 3. fails. I expect a special_stmt, but it is expecting an ASSIGN token, meaning it is matching mov_stmt.

lark.exceptions.UnexpectedToken: Unexpected token Token('SPECIAL_ASSIGN', '&=') at line 1, column 4.
Expected one of:
        * ASSIGN
Previous tokens: [Token('REG', 'RS')]

In my mind, the grammar should be unambiguous. An = always means mov_stmt, and &= always means special_stmt (which only works for reg=RS).

How do I disambiguate this?

I tried assigning priorities to different terminals, to no effect.


Solution

  • This fixes the grammar:

    stmt: mov_stmt
        | special_stmt
    
    mov_stmt: reg ASSIGN (reg | const)
    special_stmt: (special_reg SPECIAL_ASSIGN const)
    //             ^^^^^^^^^^^
    
    reg: REG | SPECIAL_REG      // <<<
    special_reg: SPECIAL_REG    // <<<
    const: DEC_NUM
    
    REG.2: /R[0-7]/             // <<<
    SPECIAL_REG.2: "RS"         // <<<
    
    DEC_NUM: /0|[1-9]\d*/i
    
    ASSIGN: "="
    SPECIAL_ASSIGN: "&="
    
    WS: /[ \t]+/
    %ignore WS
    

    It seems that when "RS" was part of the REG terminal regex, and special_stmt used a "RS" literal, there was ambiguity.

    By defining a new SPECIAL_REG terminal, (and including it in the reg rule), the lexer is able to properly disambiguate the statements.