I would like to extract from a text file only some structured patterns.
example, in the text below:
blablabla
foo FUNC1 ; blabliblo blu
I would like to isolate only 'foo FUNC1 ;'.
I was trying to use lark parser with the following parser
foo=Lark('''
start: statement*
statement: foo
| anything
anything : /.+/
foo : "foo" ID ";"
ID : /_?[a-z][_a-z0-9]*/i
%import common.WS
%import common.NEWLINE
%ignore WS
%ignore NEWLINE
''',
parser="lalr" ,
propagate_positions=True)
But the token 'anything' captures all. Is there a way to make it not greedy ? So that the token 'foo' can capture the given pattern ?
You could solve this with priorities.
For parser="lalr"
, Lark supports priorities on terminals. So you could move "foo"
into its own terminal and then assign that terminal a higher priority than the anything
terminal (which has default priority 1
):
foo : FOO ID ";"
FOO.2: "foo"
Parsing your example string then results in:
start
statement
anything blablabla
statement
foo
foo
FUNC1
statement
anything blabliblo blu
For parser="earley"
, Lark supports priorities on rules, so you could use:
foo.2 : "foo" ID ";"
Parsing your example string then results in:
start
statement
anything blablabla
statement
foo FUNC1
statement
anything blabliblo blu