pythonlark-parserlark

optional Newline at end of file messes up lark grammar parsing


I have a file contents. if the final line has a NL at the end, i get 2 token trees as expected, but of the file doesn't have a final NL, i end up with 3 trees which is unwanted.

How do i get lark to parse this correctly? I am at a loss as to why a NL at end of line affects the grouping at the beginning of line.

Correct Output, unwanted NL:

FILE = """
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
"""

Output will be:

Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'map'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])

Incorrect Output, no final NL:

FILE = """
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __"""

Output will be:

Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'map'), ['__'])
Tree(Token('RULE', 'map'), ['__', '95', '36', '32', '__', '__', '__', '__'])

Code:

from lark import Lark, Transformer, Token, Discard

FILE = """
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
"""

grammar = """
    start      : NEWLINE? map+
    map        : [coord coord*] NEWLINE?
    coord      : HEX | FILL
    HEX        : ("A".."F" | DIGIT)+
    FILL       : "__"

    %import common.DIGIT
    %import common.NEWLINE
    %import common.WS_INLINE
    %ignore WS_INLINE
"""

class MyParser(Transformer):

    def start(self, nodes: list) -> list:
        return nodes

    def NEWLINE(self, tree):
        return Discard

    def coord(self, tokens: list[Token]) -> str:
        return tokens[0].value

def parse(text: str):
    parser = Lark(grammar, start='start')
    tree = parser.parse(text)
    [print(node) for node in MyParser().transform(tree)]

parse(FILE)

Solution

  • Preliminary comments

    Using your code I don't get your result in two nor even three tokens but four :

    Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'map'), ['__', '__']
    Tree(Token('RULE', 'map'), ['95'])
    Tree(Token('RULE', 'map'), ['36', '32', '__', '__', '__', '__'])
    

    I can get your result with two tokens making the NEWLINE mandatory in map, so map : [coord coord*] NEWLINErather than map : [coord coord*] NEWLINE?

    Apart of that still in map I do not understand why [coord coord*]rather than [coord+] or even just coord+.

    Your problem

    Of course the change I did does not solve the problem when the last line is not ended by a NEWLINE, and worst produces a lark.exceptions.UnexpectedEOF.

    To allow an optional newline at the end of the last line you need to say that differently in the grammar, for instance :

    grammar = """
        start      : NEWLINE? mapnl* (mapnl | map)
        mapnl      : coord+ NEWLINE
        map        : coord+
        coord      : HEX | FILL
        HEX        : ("A".."F" | DIGIT)+
        FILL       : "__"
    
        %import common.DIGIT
        %import common.NEWLINE
        %import common.WS_INLINE
        %ignore WS_INLINE
    """
    

    Now the result is the expected one, whatever the file starts and/or ends by a NEWLINE or even have empty lines.

    Having :

    from lark import Lark, Transformer, Token, Discard
    
    FILES = [
    # a newline at the beginning and at the end
    """
    __ 95 95 36 __ 95 __ 95 __
    __ __ 95 36 32 __ __ __ __
    """,
    
    # a newline at the end but not the beginning
    """__ 95 95 36 __ 95 __ 95 __
    __ __ 95 36 32 __ __ __ __
    """,
    
    # a newline at the beginning but not at the end
    """
    __ 95 95 36 __ 95 __ 95 __
    __ __ 95 36 32 __ __ __ __""",
    
    # no newline at the beginning nor the end
    """__ 95 95 36 __ 95 __ 95 __
    __ __ 95 36 32 __ __ __ __""",
    
    # empty lines everywhere
    """
    
    __ 95 95 36 __ 95 __ 95 __
    
    __ __ 95 36 32 __ __ __ __
    
    
    """]
    
    
    grammar = """
        start      : NEWLINE? mapnl* (mapnl | map)
        mapnl      : coord+ NEWLINE
        map        : coord+
        coord      : HEX | FILL
        HEX        : ("A".."F" | DIGIT)+
        FILL       : "__"
    
        %import common.DIGIT
        %import common.NEWLINE
        %import common.WS_INLINE
        %ignore WS_INLINE
    """
    
    class MyParser(Transformer):
    
        def start(self, nodes: list) -> list:
            return nodes
    
        def NEWLINE(self, tree):
            return Discard
    
        def coord(self, tokens: list[Token]) -> str:
            return tokens[0].value
    
    def parse(text: str):
        parser = Lark(grammar, start='start')
        tree = parser.parse(text)
        [print(node) for node in MyParser().transform(tree)]
    
    for file in FILES:
        parse(file)
        print()
    
    

    the execution is :

    bruno@raspberrypi:/tmp $ python p.py 
    Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'map'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'map'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    
    bruno@raspberrypi:/tmp $ 
    

    You can also use that grammar if you prefer :

    grammar = """
        start      : NEWLINE? map* mapoptnl
        map        : coord+ NEWLINE
        mapoptnl   : coord+ NEWLINE?
        coord      : HEX | FILL
        HEX        : ("A".."F" | DIGIT)+
        FILL       : "__"
    
        %import common.DIGIT
        %import common.NEWLINE
        %import common.WS_INLINE
        %ignore WS_INLINE
    """
    

    producing the same result except the rule names of course :

    bruno@raspberrypi:/tmp $ python p.py 
    Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
    Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
    
    bruno@raspberrypi:/tmp $