pythonpython-3.xparsinglark-parser

Parser or postlex causing an error in Lark


everyone. So, I'm parsing a shell output (mocked here) and I'm running into an error where I really don't expect. Minimum reproducible, working example is below:

from rich import print as rprint
import typing as tp

from lark import Lark, Transformer, Tree
from lark.indenter import Indenter

class TreeIndenter(Indenter):
    NL_type = '_NL'
    OPEN_PAREN_types: tp.List = []
    CLOSE_PAREN_types: tp.List = []
    INDENT_type = '_INDENT'
    DEDENT_type = '_DEDENT'
    tab_len = 8

    @property
    def always_accept(self):
        return (self.NL_type,)

kwargs = {
    "parser": "lalr",
    "postlex": TreeIndenter(),
    "maybe_placeholders": False,
}


text = """
=======================================================================
SKIT                       SEASON          EPISODE         CAST NUMBER
=======================================================================
skit_name=vikings          3               10              3
skit_name=parrot           2               5               2
skit_name=eel              1               7               2
"""

grammar = r"""
start: [_NL] header data
header: line_break column_names line_break 
data: data_line+
data_line: me_info (STRING2 | STRING)* _NL
me_info: "skit_name="STRING

line_break: "="* _NL
column_names: (STRING | STRING2)* _NL


STRING2         : STRING " " STRING
STRING          : ESCAPED_STRING | VALUE
VALUE           : ("_" | LETTER | DIGIT | "-" | "[]" | "/" | "." | ":")+

%import common.ESCAPED_STRING
%import common.LETTER
%import common.DIGIT
%ignore / /

_NL: /(\r?\n[\t ]*)+/
"""

parser = Lark(grammar=grammar)
rprint(parser.parse(text))

This outputs the correct tree. Do note that kwargs isn't being used.

However, as I'd need to combine it with parser for output that is indented, I'd need to use an Indenter and the listed kwargs. When I include them, I get the following error (full trace omitted):

UnexpectedToken: Unexpected token Token('STRING', 'skit_name') at line 5, column 1.
Expected one of: 
    * __ANON_0

Meaning that the first line that forms data causes the problem, but it's not obvious what is actually expected.

However, interesting thing, if the first line break is omitted (both from the text and grammar) it successfully parses.

Additionally, it seems that the error occurs when either parser or postlex are included, and it's the same error, no matter which of them is included in kwargs.

EDIT: So I was hoping I can come up with a workaround for the indent and not use parser or postlex keywords, but it seems that specifying lalr parser is required to use the Transformer, so I will need to use that anyways so I can't just side-step the problem.


Solution

  • Providing a PostLexer changes the default parser/lexer combo from earley/dynamic to earley/basic, since the dynamic parser can't handle the postlexer. However, the basic lexer is far less powerful and can't handle this kind of ambiguity.It sees that at that point a STRING would fit and then just uses that.

    There are a few possible solutions: