I have a (Lark) grammar that I think should be unambiguous, but depending on the version of Lark, fails to parse in one way or another:
import lark
syntax = r"""
stmt: mov_stmt
| special_stmt
mov_stmt: reg ASSIGN (reg | const)
special_stmt: ("RS" SPECIAL_ASSIGN const)
reg: REG
const: DEC_NUM
REG.2: /R[0-7]|RS/
DEC_NUM: /0|[1-9]\d*/i
ASSIGN: "="
SPECIAL_ASSIGN: "&="
WS: /[ \t]+/
%ignore WS
"""
parser = lark.Lark(syntax, start="stmt", parser="lalr")
print(parser.parse("R3 = 7")) # 1. ok
print(parser.parse("R3 = R7")) # 2. ok
print(parser.parse("RS &= 1")) # 3. Fails on lark==1.1.9; expected special_stmt
print(parser.parse("RS = R7")) # 4. Fails on lark-parser==0.12.0; expected mov_stmt
With lark-parser==0.12.0
, invocation number 4. fails. I expect a mov_stmt
, but it is expecting a SPECIAL_ASSIGN
token, meaning it is matching special_stmt
.
lark.exceptions.UnexpectedToken: Unexpected token Token('ASSIGN', '=') at line 1, column 4.
Expected one of:
* SPECIAL_ASSIGN
Previous tokens: [Token('RS', 'RS')]
With lark==1.1.9
, the opposite happens and invocation number 3. fails. I expect a special_stmt
, but it is expecting an ASSIGN
token, meaning it is matching mov_stmt
.
lark.exceptions.UnexpectedToken: Unexpected token Token('SPECIAL_ASSIGN', '&=') at line 1, column 4.
Expected one of:
* ASSIGN
Previous tokens: [Token('REG', 'RS')]
In my mind, the grammar should be unambiguous. An =
always means mov_stmt
, and &=
always means special_stmt
(which only works for reg=RS
).
How do I disambiguate this?
I tried assigning priorities to different terminals, to no effect.
This fixes the grammar:
stmt: mov_stmt
| special_stmt
mov_stmt: reg ASSIGN (reg | const)
special_stmt: (special_reg SPECIAL_ASSIGN const)
// ^^^^^^^^^^^
reg: REG | SPECIAL_REG // <<<
special_reg: SPECIAL_REG // <<<
const: DEC_NUM
REG.2: /R[0-7]/ // <<<
SPECIAL_REG.2: "RS" // <<<
DEC_NUM: /0|[1-9]\d*/i
ASSIGN: "="
SPECIAL_ASSIGN: "&="
WS: /[ \t]+/
%ignore WS
It seems that when "RS" was part of the REG
terminal regex, and special_stmt
used a "RS" literal, there was ambiguity.
By defining a new SPECIAL_REG
terminal, (and including it in the reg
rule), the lexer is able to properly disambiguate the statements.