[SOLVED] PEG grammar ordered choice failure

PEG grammar ordered choice failure

I have a PEG grammar for a toy DSL using the Python Arpeggio package:

from arpeggio.cleanpeg import ParserPEG

grammar = """
    root    = block* EOF
    block   = header (item1+ / item2+)
    header  = "block"
    item1   = number name comment?
    item2   = number name list comment?
    number  = r"\d+"
    name    = r"\w+"
    list    = r"\[.*\]"
    comment = r"\/\/.*"
"""

doc = """
block
  5 alpha []        //
  3 beta [a, b, c]  // this is an item2

block
  6 foo
  1 bar  // This is an item1
  4 baz  // more stuff
"""

parser = ParserPEG(grammar, 'root', debug=True)
parse_tree = parser.parse(doc)
print ('Tree:', parse_tree)

This gives strange results when parsing the test document: it correctly fails to match item1 in the ordered choice, but then falsely claims (the line marked with xxxx) that it did match the choice without ever testing item2, which would have matched.

>> Matching rule root=Sequence at position 0 => * block   5
   >> Matching rule ZeroOrMore in root at position 0 => * block   5
      >> Matching rule block=Sequence in root at position 0 => * block   5
         ?? Try match rule header=StrMatch(block) in block at position 1 =>  *block   5 
         ++ Match 'block' at 1 => ' *block*   5 '
         >> Matching rule OrderedChoice in block at position 6 =>  block*   5 alpha
            >> Matching rule OneOrMore in block at position 6 =>  block*   5 alpha
               >> Matching rule item1=Sequence in block at position 6 =>  block*   5 alpha
                  ?? Try match rule number=RegExMatch(\d+) in item1 at position 9 =>  block   *5 alpha []
                  ++ Match '5' at 9 => ' block   *5* alpha []'
                  ?? Try match rule name=RegExMatch(\w+) in item1 at position 11 => block   5 *alpha []  
                  ++ Match 'alpha' at 11 => 'block   5 *alpha* []  '
                  >> Matching rule Optional in item1 at position 16 =>    5 alpha* []       
                     ?? Try match rule comment=RegExMatch(\/\/.*) in item1 at position 17 =>   5 alpha *[]        
                     -- NoMatch at 17
                  <<- Not matched rule Optional in item1 at position 16 =>    5 alpha* []       
               <<+ Matched rule item1=Sequence in item1 at position 16 =>    5 alpha* []       
               >> Matching rule item1=Sequence in block at position 16 =>    5 alpha* []       
                  ?? Try match rule number=RegExMatch(\d+) in item1 at position 17 =>   5 alpha *[]        
                  -- NoMatch at 17
               <<- Not matched rule item1=Sequence in item1 at position 16 =>    5 alpha* []
xxxx-->     <<+ Matched rule OneOrMore in block at position 16 =>    5 alpha* []       
         <<+ Matched rule OrderedChoice in block at position 16 =>    5 alpha* []       
      <<+ Matched rule block=Sequence in block at position 16 =>    5 alpha* []       
      >> Matching rule block=Sequence in root at position 16 =>    5 alpha* []       
         ?? Try match rule header=StrMatch(block) in block at position 17 =>   5 alpha *[]        
         -- No match 'block' at 17 => '  5 alpha *[]   *     '
      <<- Not matched rule block=Sequence in block at position 16 =>    5 alpha* []       
   <<+ Matched rule ZeroOrMore in root at position 16 =>    5 alpha* []       
   ?? Try match rule EOF in root at position 17 =>   5 alpha *[]        
   !! EOF not matched.
<<- Not matched rule root=Sequence in root at position 0 => * block   5

As a result, the parser failed to consume the item2s that it would have consumed had it actually matched item2 after item1 failed.

Is this a bug in the parser package, or in my grammar?

Note that reversing the ordered choice:

    block   = header (item2+ / item1+)

correctly parses the example document. But an anomalous result that is easy to work around in a toy problem may be much more difficult to find in real grammars. An item is unambiguously either an item1 or item2, so the order in which they are checked should be irrelevant, and the code for parsing them should work consistently.

Solution

In PEG the order of expressions is OrderedChoice is important. When the parser try item1+ it is enough to match at least one of item1 to succeed and the whole ordered choice is then considered successful.

In general, always put more specific matches at the beginning and more general towards the end of ordered choice.

Update: there is a nice explanation in Ambiguity detection and influence of rule order on language that is matched section on Wikipedia.