I'm having some trouble parsing an input. Below is the offending snippet:
?start: forstmt
forstmt: "for" ID "in" range FOO -> forstmt
?range: "[" atom DOTDOT atom "]" -> rangetuple
DOTDOT.2: ".."
?atom: NUM -> num
%import common.NUMBER -> NUM
%import common.WS
%ignore WS
When I input something like for i in [1..10] print(i)
, I expect a tokenization similar to:
[
1
..
10
]
But even with assigning the center dots their own named terminal, and assigning that a higher priority than the numbers, it still seems to always parse as:
[
1.
.10
]
I've tried changing the terminal priority, I've tried changing the rule priority under the earley
parser, and I've tried running the earley
parser with ambiguity='explicit'
, all with no change to the current erroneous output.
Here's the error I get when running with parser(grammar, parser='lalr')
:
Unexpected token Token('NUM', '.10') at line 1, column 13.
Expected one of:
* DOTDOT
* MINUS
* STAR
* SLASH
* PLUS
Previous tokens: [Token('NUM', '1.')]
...and here is the full grammar I'm using in case I omitted something important.
?start: stmtlist
?stmtlist: stmt (";" stmt?)* -> stmtlist
stmt: "var" ID "=" expr -> decl
| ID "=" expr -> assign
| "if" "(" expr ")" stmt ("else" stmt)? -> ifstmt
| "while" "(" expr ")" stmt -> whstmt
| "for" ID "in" range stmt -> forstmt
| "print" "(" expr ")" -> prstmt
| "{" stmtlist "}" -> block
?range: "[" expr DOTDOT expr "]" -> rangetuple
DOTDOT.2: ".."
?expr: expr "+" term -> add
| expr "-" term -> sub
| term
?term: term "*" atom -> mul
| term "/" atom -> div
| atom
?atom: "(" expr ")"
| ID -> var
| NUM -> num
%import common.CNAME -> ID
%import common.NUMBER -> NUM
%import common.WS
%ignore WS
Also, I am aware that using integers instead of all numeric input types for NUM
resolves the issue, I was just hoping to not be so restrictive with the grammar if at all possible. I also know that having real number values in a range parameter is nonsense, but I'm hoping to accept expressions like i+3
in the range value, which can reuse my functions which can accept floats. I was hoping to do the type checking for the range domain in the rangetuple
function; it makes more sense to me to be able to reuse my functions and just do my type guarding/error checking in the functions using the expressions.
Minimum reproducible example:
Lark: 1.1.9
Python: 3.12.3
import sys
from lark import Lark, v_args
from lark.visitors import Interpreter
from typing import Any
grammar = """
?start: stmtlist
?stmtlist: stmt? (";" stmt?)* -> stmtlist
stmt: "for" ID "in" range stmt -> forstmt
| "print" "(" expr ")" -> prstmt
| "{" stmtlist "}" -> block
?range: "[" expr ".." expr "]" -> rangetuple
?expr: ID -> var
| NUM -> num
%import common.CNAME -> ID
%import common.NUMBER -> NUM
%import common.WS
%ignore WS
"""
parser = Lark(grammar)
class Env():
'''Keeps track of variables and manages scopes for those variables.'''
def __init__(self) -> None:
# initializes with one scope open by default; global scope
self._scopes: list[dict] = [dict()]
def extend(self, x: str, val: Any | None = None) -> None:
'''Declares `x` as a variable in the
current scope, with `x = val`.'''
location = self._findScope(x)
# variable is undefined or has not been declared in current scope
if (location is None) or (location != (len(self._scopes) - 1)):
self._scopes[-1][x] = val
# variable is defined in current scope
else:
raise NameError(f'{x} was redeclared', name=x)
def update(self, x: str, val: Any) -> None:
'''Assigns `val` to current visible `x`.'''
location = self._findScope(x)
if location is not None:
self._scopes[location][x] = val # overwrite nearest x
else:
raise NameError(f'{x} is undefined', name=x)
def _findScope(self, x) -> int | None:
'''Return the scope index that `x` exists in, if any.
Returns index in order by environment recency.'''
for index in range(len(self._scopes)-1, -1, -1):
if x in self._scopes[index]:
return index
return None
def lookup(self, x: str) -> Any:
'''returns `x`'s value from the most recently assigned scope.'''
location = self._findScope(x)
if location is not None:
return self._scopes[location][x]
else:
raise NameError(f"{x} is undefined", name=x)
def openScope(self) -> None:
'''Increases variable stack height.'''
self._scopes.append(dict())
def closeScope(self) -> None:
'''Decreases variable stack height.'''
if len(self._scopes) >= 1:
del self._scopes[-1]
else:
raise IndexError("attempted to close scope when none were open")
@v_args(inline=True)
class Eval(Interpreter):
def __init__(self, env: Env | None = None):
super().__init__()
# if no environment is supplied, it will rely on its own
self.env = env if env is not None else Env()
def num(self, val) -> int | float:
for char in ".e": # signifiers for float
if char in val:
return float(val)
return int(val)
def var(self, var: str) -> int | float:
return self.env.lookup(var)
def block(self, statements) -> None:
self.env.openScope()
self.visit(statements)
self.env.closeScope()
def stmtlist(self, *statements) -> None:
for statement in statements:
self.visit(statement)
def forstmt(self, id, rangetuple, stmt) -> None:
self.env.openScope()
self.env.extend(id)
for i in range(*self.visit(rangetuple)):
self.env.update(id, i)
self.visit(stmt)
self.env.closeScope()
def rangetuple(self, expr1, expr2) -> tuple[int, int]:
return (self.visit(expr1), self.visit(expr2))
def main():
try:
prog = sys.stdin.read()
tree = parser.parse(prog)
Eval().visit(tree)
except Exception as e:
print(e)
if __name__ == "__main__":
main()
and I'm piping this file to stdin
as input:
for i in [1..3] print(i)
results in:
No terminal matches '.' in the current parser context, at line 1 col 13
for i in [1..3] print(i)
^
Expected one of:
* __ANON_0
And changing the parsers used yields similar results.
Most scanner(lexer) implementations including one in Lark are context-free. It means the scanner matches "1..3" as a sequence of a number "1." and an invalid character ".", instead of matching it as a sequence of "1", "..", and "3" by looking back(backtracking) when the scanner finds an invalid character.
So, you can't do what you want. I recommend you define your specification saying the user should avoid such an ambiguous code, such as by writing "1 .. 3" instead of "1..3".