I have a file contents. if the final line has a NL at the end, i get 2 token trees as expected, but of the file doesn't have a final NL, i end up with 3 trees which is unwanted.
How do i get lark to parse this correctly? I am at a loss as to why a NL at end of line affects the grouping at the beginning of line.
FILE = """
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
"""
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'map'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
FILE = """
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __"""
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'map'), ['__'])
Tree(Token('RULE', 'map'), ['__', '95', '36', '32', '__', '__', '__', '__'])
from lark import Lark, Transformer, Token, Discard
FILE = """
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
"""
grammar = """
start : NEWLINE? map+
map : [coord coord*] NEWLINE?
coord : HEX | FILL
HEX : ("A".."F" | DIGIT)+
FILL : "__"
%import common.DIGIT
%import common.NEWLINE
%import common.WS_INLINE
%ignore WS_INLINE
"""
class MyParser(Transformer):
def start(self, nodes: list) -> list:
return nodes
def NEWLINE(self, tree):
return Discard
def coord(self, tokens: list[Token]) -> str:
return tokens[0].value
def parse(text: str):
parser = Lark(grammar, start='start')
tree = parser.parse(text)
[print(node) for node in MyParser().transform(tree)]
parse(FILE)
Preliminary comments
Using your code I don't get your result in two nor even three tokens but four :
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'map'), ['__', '__']
Tree(Token('RULE', 'map'), ['95'])
Tree(Token('RULE', 'map'), ['36', '32', '__', '__', '__', '__'])
I can get your result with two tokens making the NEWLINE mandatory in map, so map : [coord coord*] NEWLINE
rather than map : [coord coord*] NEWLINE?
Apart of that still in map I do not understand why [coord coord*]
rather than [coord+]
or even just coord+
.
Your problem
Of course the change I did does not solve the problem when the last line is not ended by a NEWLINE, and worst produces a lark.exceptions.UnexpectedEOF
.
To allow an optional newline at the end of the last line you need to say that differently in the grammar, for instance :
grammar = """
start : NEWLINE? mapnl* (mapnl | map)
mapnl : coord+ NEWLINE
map : coord+
coord : HEX | FILL
HEX : ("A".."F" | DIGIT)+
FILL : "__"
%import common.DIGIT
%import common.NEWLINE
%import common.WS_INLINE
%ignore WS_INLINE
"""
Now the result is the expected one, whatever the file starts and/or ends by a NEWLINE or even have empty lines.
Having :
from lark import Lark, Transformer, Token, Discard
FILES = [
# a newline at the beginning and at the end
"""
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
""",
# a newline at the end but not the beginning
"""__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
""",
# a newline at the beginning but not at the end
"""
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __""",
# no newline at the beginning nor the end
"""__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __""",
# empty lines everywhere
"""
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
"""]
grammar = """
start : NEWLINE? mapnl* (mapnl | map)
mapnl : coord+ NEWLINE
map : coord+
coord : HEX | FILL
HEX : ("A".."F" | DIGIT)+
FILL : "__"
%import common.DIGIT
%import common.NEWLINE
%import common.WS_INLINE
%ignore WS_INLINE
"""
class MyParser(Transformer):
def start(self, nodes: list) -> list:
return nodes
def NEWLINE(self, tree):
return Discard
def coord(self, tokens: list[Token]) -> str:
return tokens[0].value
def parse(text: str):
parser = Lark(grammar, start='start')
tree = parser.parse(text)
[print(node) for node in MyParser().transform(tree)]
for file in FILES:
parse(file)
print()
the execution is :
bruno@raspberrypi:/tmp $ python p.py
Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'map'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'map'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'mapnl'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
bruno@raspberrypi:/tmp $
You can also use that grammar if you prefer :
grammar = """
start : NEWLINE? map* mapoptnl
map : coord+ NEWLINE
mapoptnl : coord+ NEWLINE?
coord : HEX | FILL
HEX : ("A".."F" | DIGIT)+
FILL : "__"
%import common.DIGIT
%import common.NEWLINE
%import common.WS_INLINE
%ignore WS_INLINE
"""
producing the same result except the rule names of course :
bruno@raspberrypi:/tmp $ python p.py
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
Tree(Token('RULE', 'map'), ['__', '95', '95', '36', '__', '95', '__', '95', '__'])
Tree(Token('RULE', 'mapoptnl'), ['__', '__', '95', '36', '32', '__', '__', '__', '__'])
bruno@raspberrypi:/tmp $