I have strings with words like this:
I made some rules to parse it and with larks Early-parser it works fine. Now i want to try it out with the Lalr-Parser but then the special words are not recognized. I am interested in the capitalized letter combinations and the numbers. I have a list of the possible letter combinations. The numbers are always two digits. The rest of the string can be anything.
I am using lark. This is my parser:
from lark import Lark
tryoutabc = Lark(r"""
start: mat_all_and_restl
?mat_all_and_restl: mat_all restl | restl | restl mat_br
restl: REST*
?mat_all: mat_br| mat_num| num
mat_num: mat_br ("-")? NUM
?num: NUM ("ZZ")?
NUM: /\d{2}/
?mat_br: "("? MAT ")"?
MAT: "ABC"
| "HFG"
| "ZT"
| "DFG"
REST: /([\-])?[A-Za-zÜ]+([\/.][A-Za-zÜ]+)?/
%import common.WS
%ignore WS
""", start='start', parser='lalr') #remove "parser='lalr'" and the results are right
How do i have to change the rules to make it possible with the lalr-parser to parse this?
To try it out:
textl = ["ABC SOME OTHER STUFF", "Some other stuff", "HFG 54 Some other stuff and even more",
"HFG 54 ZZ Some other stuff and even more", "HFG-54 Some other stuff and even more", "ZT Some other stuff",
"(ZT) Some other stuff", "Some other stuff DFG", "Some other stuff (DFG)", "68 Some other stuff "]
for text in textl:
print(tryoutabc.parse(text).pretty())
I get
start
restl
ABC
SOME
OTHER
STUFF
but i want
start
ABC
restl
SOME
OTHER
STUFF
For "HFG 54 Some other stuff and even more"
i get an error:
HFG 54 Some other stuff and even more
^
Expecting: {'REST'}`
but i want:
start
mat_num
HFG
54
restl
Some
other
stuff
and
even
more
In reality the strings are longer and look like this "Interesting stuff i already parsed ABC Some other stuff" and i already parsed the stuff at the beginning of the strings and it works pefect.
From the comments it appears that it is not possible to do that because i dont have a context free language here and apparently larl can only do cfg-languages. If somebody adds and answer where this both is explained quick i would be happy.
The problem is the REST
terminal. It can parse almost anything by design, meaning that
start
mat_all_and_restl
ABC
restl
SOME
OTHER
STUFF
and
start
restl
ABC
SOME
OTHER
STUFF
Are both valid interpretations of the first example sentences (ABC SOME OTHER STUFF
). This is something you can take a look at if you pass ambiguity='explicit'
to an earley parser.
This example works with lalr
, since there isn't a problem if lalr
just interprets everything as REST
, but it does give you the wrong result.
The problematic example "HFG 54 Some other stuff and even more"
on the other hand doesn't work. It takes HFG
as REST
, and then tries to parse 54
which it can since it isn't REST
and it doesn't expect a NUM
right now.
If this is the full extended of you grammar, use earley
, or for even more speed than lalr could provided, a (or multiple) regex might also do the job quite well.
(Note, I misspoke: this is potentially a CFG, just ambiguous. You could probably continue to hack on the grammar/the REST
regex to make it work, but it won't be pretty.)