pythonparsingindentationpyparsing

How do I parse ambiguous indented blocks like this using pyparsing?


I'm trying to parse data in the following format:

data = """\
map=1
  sub=1
    int=99
    foo=bar
  sub=2
    foo=bar
    int=99
    bar=qux
"""

I based my grammar on the example from the pyparsing repository and this is what I got:

from pyparsing import *


stmt = Forward()
suite = IndentedBlock(stmt)
identifier = Word(alphas, alphanums)
key = Combine(identifier + "=" + Word(nums))
definition = key + suite

rhs = Regex(r"[a-z0-9]+")
lhs = identifier + Suppress("=") + rhs
stmt << (definition | lhs)

body = ZeroOrMore(stmt)

# run it

tree = body.parse_string(input_string)
print(tree)

The resulting parse-tree is almost correct:

['map=1', ['sub=1', ['int=99', ['foo', 'bar']], 'sub=2', ['foo', 'bar', 'int=99', ['bar', 'qux']]]]

Since the map/submap keys contain an integer index postfix, e.g. =1, they are syntactically the same as integer assignments. This causes the parser to treat key-value pairs after an integer assignment as a new indented block.

The desired outcome, however, is something like

['map=1', ['sub=1', ['int', '99', 'foo', 'bar'], ['sub=2', ['foo', 'bar', 'int', '99', 'bar', 'qux']]

How can I eliminate this ambiguity? Some kind of negative lookahead, perhaps?


Solution

  • Syntaxes that have significant whitespace for indentation have always been a challenge for pyparsing, since its default behavior is to ignore any intervening whitespace between tokens and sub-expressions. I've had several attempts at defining a macro for parsing indented lines such as yours, and the latest is a class, IndentedBlock. Here is a stab at parsing your data using IndentedBlock:

    data = """\
    map=1
      sub=1
        int=99
        foo=bar
      sub=2
        foo=bar
        int=99
        bar=qux
    """
    
    import pyparsing as pp
    ppc = pp.common
    from pprint import pprint
    
    
    EQ = pp.Suppress('=')
    key = pp.Word(pp.alphas, pp.alphanums)
    value = ppc.number | pp.Word(pp.alphanums)
    
    key_value = pp.Group(key("key") + EQ + value("value"))
    
    
    structure = pp.IndentedBlock(key_value, recursive=True, grouped=True)
    parsed = structure.parse_string(data)
    
    from pprint import pprint
    pprint(parsed.as_list(), width=32)
    

    Prints:

    [[['map', 1],
      [['sub', 1],
       [['int', 99],
        ['foo', 'bar']],
       ['sub', 2],
       [['foo', 'bar'],
        ['int', 99],
        ['bar', 'qux']]]]]
    

    Which I think is fairly close to what you are striving for. I chose to wrap each key=value part in its own group, along with results names "key" and "value" to access each group's parts by name instead of by numeric 0 or 1 index.