I'm trying to parse data in the following format:
data = """\
map=1
sub=1
int=99
foo=bar
sub=2
foo=bar
int=99
bar=qux
"""
I based my grammar on the example from the pyparsing
repository and this is what I got:
from pyparsing import *
stmt = Forward()
suite = IndentedBlock(stmt)
identifier = Word(alphas, alphanums)
key = Combine(identifier + "=" + Word(nums))
definition = key + suite
rhs = Regex(r"[a-z0-9]+")
lhs = identifier + Suppress("=") + rhs
stmt << (definition | lhs)
body = ZeroOrMore(stmt)
# run it
tree = body.parse_string(input_string)
print(tree)
The resulting parse-tree is almost correct:
['map=1', ['sub=1', ['int=99', ['foo', 'bar']], 'sub=2', ['foo', 'bar', 'int=99', ['bar', 'qux']]]]
Since the map/submap keys contain an integer index postfix, e.g. =1
, they are syntactically the same as integer assignments. This causes the parser to treat key-value pairs after an integer assignment as a new indented block.
The desired outcome, however, is something like
['map=1', ['sub=1', ['int', '99', 'foo', 'bar'], ['sub=2', ['foo', 'bar', 'int', '99', 'bar', 'qux']]
How can I eliminate this ambiguity? Some kind of negative lookahead, perhaps?
Syntaxes that have significant whitespace for indentation have always been a challenge for pyparsing, since its default behavior is to ignore any intervening whitespace between tokens and sub-expressions. I've had several attempts at defining a macro for parsing indented lines such as yours, and the latest is a class, IndentedBlock
. Here is a stab at parsing your data using IndentedBlock
:
data = """\
map=1
sub=1
int=99
foo=bar
sub=2
foo=bar
int=99
bar=qux
"""
import pyparsing as pp
ppc = pp.common
from pprint import pprint
EQ = pp.Suppress('=')
key = pp.Word(pp.alphas, pp.alphanums)
value = ppc.number | pp.Word(pp.alphanums)
key_value = pp.Group(key("key") + EQ + value("value"))
structure = pp.IndentedBlock(key_value, recursive=True, grouped=True)
parsed = structure.parse_string(data)
from pprint import pprint
pprint(parsed.as_list(), width=32)
Prints:
[[['map', 1],
[['sub', 1],
[['int', 99],
['foo', 'bar']],
['sub', 2],
[['foo', 'bar'],
['int', 99],
['bar', 'qux']]]]]
Which I think is fairly close to what you are striving for. I chose to wrap each key=value
part in its own group, along with results names "key" and "value" to access each group's parts by name instead of by numeric 0 or 1 index.