pythonparsinggrammarlark-parser

Why does this grammar work using the Earley parser but not LALR(1)


I wrote this grammar to describe a simple patch procedure we use at work, operating on source code files. Engineers check in files of the form:

{ignored space}
//find_start {possible comment or directive}
{content to find}
//find_end
{ignored space}
//replace_start
{content to replace it with}
//replace_end
{ignored_space}
...

Given the following grammar, Lark, with the Earley parser, will produce an accurate parse tree of example text in the format.

start: block+

block: ignore? find_block ignore? replace_block ignore?

find_block:  "//find_start" [comment] content "//find_end" 
replace_block: "//replace_start" content "//replace_end"

_NL: /\n/+
LINE.-10: /.+/
COMMENT: /\s.+/
content: (LINE _NL)+
ignore: (LINE _NL)+
comment: (COMMENT _NL)

%import common.NEWLINE
%import common.WS
%ignore WS
%ignore NEWLINE

However, when attempting to use the LALR(1) parser, I get the following exception:

lark.exceptions.UnexpectedToken: Unexpected token Token('LINE', 'a') at line 3, column 1.
Expected one of: 
        * _NL
Previous tokens: [Token('LINE', 'asdf')]

I've tried several iterations of this grammar, and had alternating success in LALR(1) and Earley. This version is the first to produce the output I'm looking for. After spending some time looking online, I understand that the grammar I've cobbled together isn't the best, and I'm open to any suggestions for improvements.

Example text I've been using to test the grammar:

//find_start asdf
a
//find_end

//replace_start
A
//replace_end

//find_start
b
//find_end
this should be ignored
//replace_start
B
//replace_end

//find_start
c
//find_end

//replace_start
C
C
//replace_end

For those curious why we use such an admittedly hacky patch process, it's somewhat by necessity. These patches operate on Verilog RTL code. Due to IP restrictions, the engineers writing the original behavioral code are not allowed to know about library specifics. Physical design engineers like myself take this behavioral code and convert parts of it into structural, gate-level code for optimization. This results in patches that can never be checked back into upstream. Part of why I'm writing this grammar is to eventually remove this process entirely.


Solution

  • Your grammar is ambiguous. (An ignore could be the end of one block, or the start of the next.) Earley's algorithm can handle ambiguous grammars, but LALR(1) parser construction cannot. That seems like it would be the answer to your question, but I don't think it is, because I'd expect the LALR(1) to fail at parser-construction time with a shift-reduce conflict, but the error you're getting is at parse time.

    Instead, I think the error arises because the grammar is conflicted about whitespace: the definitions of _NL and COMMENT involve \n and \s, and yet the grammar says %ignore WS and %ignore NEWLINE. If the tokenizer is following the %ignore directives, it can't ever recognize COMMENT or _NL. (Which is why the error message says that it found two successive LINE tokens.) Mind you, it seems like that objection would also apply to the Earley parser, so I don't know why you're not getting the same error there.

    Aside from your question, it seems like this is not a use case that you should be using a parser for. Even regular expressions would be overkill. I'd do something like this:

    loop
       read lines (echoing them to output, presumably)
       until line starts with "//find_start"
    
       read lines (discarding them)
       until line starts with "//find_end"
    
       read a line, ignore it
       assert next line starts with "//replace_start"
    
       read lines (echoing them to output)
       until line starts with "//replace_end"
    end
    

    (modulo handling end-of-file and errors)