pythonparsinglalrlark-parser

Parse trailing line comments with lark and LALR


Given are the following lark grammar and Python source code:

start: (TEXT _NEWLINE)+

TEXT: /[^\n]+/

COMMENT: /\/\/[^\n]*/ _NEWLINE
%ignore COMMENT

_NEWLINE: (" "* "\n")+
from lark import Lark

parser = Lark.open("grammar.lark", parser='lalr')

parser.parse("""Lorem ipsum
// line comment
Text with // trailing comment
""")

The above parser produces this tree: actual tree

The first line of text is parsed correctly and the second line (which is a comment) is ignored as was intended. However, the last line contains the comment that is supposed to be ignored.

This is the expected output: expected tree

I realize that it is perfectly legal in my grammar to have two consecutive slashes in a TEXT node (which should actually introduce a line comment). However, I do not know how to prevent this. Is there any way I can disallow two consecutive slashes in TEXT or give higher priority to the COMMENT terminal?


Solution

  • I just found a grammar that seems to work:

    start: (TEXT _NEWLINE)+
    
    TEXT: /(\/?[^\n\/])+/
    
    COMMENT: /\/\/[^\n]*/
    %ignore COMMENT
    
    _NEWLINE: (" "* COMMENT? "\n")+
    

    I doubt this is the most elegant solution, so I'd appreciate another answer or suggestions.