pythonlark

Lark parser issues with tokenizing priority


I'm having some trouble parsing an input. Below is the offending snippet:

?start: forstmt

forstmt: "for" ID "in" range FOO  -> forstmt

?range: "[" atom DOTDOT atom "]" -> rangetuple

DOTDOT.2: ".."

?atom: NUM            -> num

%import common.NUMBER -> NUM
%import common.WS
%ignore WS

When I input something like for i in [1..10] print(i), I expect a tokenization similar to:

[ 1 .. 10 ]

But even with assigning the center dots their own named terminal, and assigning that a higher priority than the numbers, it still seems to always parse as:

[ 1. .10 ]

I've tried changing the terminal priority, I've tried changing the rule priority under the earley parser, and I've tried running the earley parser with ambiguity='explicit', all with no change to the current erroneous output.

Here's the error I get when running with parser(grammar, parser='lalr'):

Unexpected token Token('NUM', '.10') at line 1, column 13.
Expected one of: 
        * DOTDOT
        * MINUS
        * STAR
        * SLASH
        * PLUS
Previous tokens: [Token('NUM', '1.')]

...and here is the full grammar I'm using in case I omitted something important.

?start: stmtlist

?stmtlist: stmt (";" stmt?)* -> stmtlist

stmt: "var" ID "=" expr         -> decl
    | ID "=" expr               -> assign
    | "if" "(" expr ")" stmt ("else" stmt)? -> ifstmt
    | "while" "(" expr ")" stmt -> whstmt
    | "for" ID "in" range stmt  -> forstmt
    | "print" "(" expr ")"      -> prstmt
    | "{" stmtlist "}"          -> block

?range: "[" expr DOTDOT expr "]" -> rangetuple

DOTDOT.2: ".."

?expr: expr "+" term  -> add
    |  expr "-" term  -> sub
    |  term

?term: term "*" atom  -> mul
    |  term "/" atom  -> div
    |  atom

?atom: "(" expr ")"
    |  ID             -> var
    |  NUM            -> num

  %import common.CNAME  -> ID
  %import common.NUMBER -> NUM
  %import common.WS
  %ignore WS

Also, I am aware that using integers instead of all numeric input types for NUM resolves the issue, I was just hoping to not be so restrictive with the grammar if at all possible. I also know that having real number values in a range parameter is nonsense, but I'm hoping to accept expressions like i+3 in the range value, which can reuse my functions which can accept floats. I was hoping to do the type checking for the range domain in the rangetuple function; it makes more sense to me to be able to reuse my functions and just do my type guarding/error checking in the functions using the expressions.

Minimum reproducible example:

Lark: 1.1.9

Python: 3.12.3

import sys
from lark import Lark, v_args
from lark.visitors import Interpreter
from typing import Any

grammar = """
?start: stmtlist

?stmtlist: stmt? (";" stmt?)* -> stmtlist

stmt: "for" ID "in" range stmt  -> forstmt
    | "print" "(" expr ")"      -> prstmt
    | "{" stmtlist "}"          -> block

?range: "[" expr ".." expr "]" -> rangetuple

?expr: ID             -> var
    |  NUM            -> num

%import common.CNAME  -> ID
%import common.NUMBER -> NUM
%import common.WS
%ignore WS
"""

parser = Lark(grammar)


class Env():
    '''Keeps track of variables and manages scopes for those variables.'''

    def __init__(self) -> None:
        # initializes with one scope open by default; global scope
        self._scopes: list[dict] = [dict()]

    def extend(self, x: str, val: Any | None = None) -> None:
        '''Declares `x` as a variable in the
        current scope, with `x = val`.'''
        location = self._findScope(x)
        # variable is undefined or has not been declared in current scope
        if (location is None) or (location != (len(self._scopes) - 1)):
            self._scopes[-1][x] = val
        # variable is defined in current scope
        else:
            raise NameError(f'{x} was redeclared', name=x)

    def update(self, x: str, val: Any) -> None:
        '''Assigns `val` to current visible `x`.'''
        location = self._findScope(x)

        if location is not None:
            self._scopes[location][x] = val  # overwrite nearest x
        else:
            raise NameError(f'{x} is undefined', name=x)

    def _findScope(self, x) -> int | None:
        '''Return the scope index that `x` exists in, if any.
        Returns index in order by environment recency.'''
        for index in range(len(self._scopes)-1, -1, -1):
            if x in self._scopes[index]:
                return index

        return None

    def lookup(self, x: str) -> Any:
        '''returns `x`'s value from the most recently assigned scope.'''
        location = self._findScope(x)

        if location is not None:
            return self._scopes[location][x]
        else:
            raise NameError(f"{x} is undefined", name=x)

    def openScope(self) -> None:
        '''Increases variable stack height.'''
        self._scopes.append(dict())

    def closeScope(self) -> None:
        '''Decreases variable stack height.'''
        if len(self._scopes) >= 1:
            del self._scopes[-1]
        else:
            raise IndexError("attempted to close scope when none were open")


@v_args(inline=True)
class Eval(Interpreter):
    def __init__(self, env: Env | None = None):
        super().__init__()
        # if no environment is supplied, it will rely on its own
        self.env = env if env is not None else Env()

    def num(self, val) -> int | float:
        for char in ".e":  # signifiers for float
            if char in val:
                return float(val)

        return int(val)

    def var(self, var: str) -> int | float:
        return self.env.lookup(var)

    def block(self, statements) -> None:
        self.env.openScope()
        self.visit(statements)
        self.env.closeScope()

    def stmtlist(self, *statements) -> None:
        for statement in statements:
            self.visit(statement)

    def forstmt(self, id, rangetuple, stmt) -> None:
        self.env.openScope()
        self.env.extend(id)
        for i in range(*self.visit(rangetuple)):
            self.env.update(id, i)
            self.visit(stmt)
        self.env.closeScope()

    def rangetuple(self, expr1, expr2) -> tuple[int, int]:
        return (self.visit(expr1), self.visit(expr2))


def main():
    try:
        prog = sys.stdin.read()
        tree = parser.parse(prog)
        Eval().visit(tree)
    except Exception as e:
        print(e)


if __name__ == "__main__":
    main()

and I'm piping this file to stdin as input:

for i in [1..3] print(i)

results in:

No terminal matches '.' in the current parser context, at line 1 col 13

for i in [1..3] print(i)
            ^
Expected one of: 
        * __ANON_0

And changing the parsers used yields similar results.


Solution

  • Most scanner(lexer) implementations including one in Lark are context-free. It means the scanner matches "1..3" as a sequence of a number "1." and an invalid character ".", instead of matching it as a sequence of "1", "..", and "3" by looking back(backtracking) when the scanner finds an invalid character.

    So, you can't do what you want. I recommend you define your specification saying the user should avoid such an ambiguous code, such as by writing "1 .. 3" instead of "1..3".