pyparsing

pyparsing fails to find a complete parse in presence of an ambiguous grammar?


I am trying to apply pyparsing to solve the Advent of Code 2015 Day 19 puzzle.

The simplest example shows an ambiguous grammar e => H; e => O; H => HO; H => OH; O => HH and seeks the number of replacements to parse HOH from the start symbol e. One such solution is e -> O -> HH -> HOH.

I attempted the following (which doesn't yet count the number of replacements):

  import pyparsing as pp
  pp.ParserElement.enable_left_recursion()

  sym_H = pp.Forward()
  sym_O = pp.Forward()

  sym_e = sym_H ^ sym_O
  sym_H <<= (sym_H + sym_O) ^ (sym_O + sym_H) ^ 'H'
  sym_O <<= (sym_H + sym_H) ^ 'O'

  # This should parse the entire string using e -> O -> HH -> HOH  but fails?
  print(sym_e.parse_string('HOH'))  # Shows: ['H', 'O'].

This code uses the Or (^) operator rather than the usual MatchFirst (|) because the documentation explains that it should find a longest possible match.

Is there intuition for why parse_string cannot find a (non-unique) expansion that matches the full string? Is there perhaps a simple code modification to make it work?

Someone successfully used CYK parsing on this puzzle; is that parsing algorithm more general (but slower) than the one in pyparsing?


Solution

  • Interesting problem, and it touches on a feature in pyparsing that is still relatively new to me, the support for left-recursion.

    I made some essentially cosmetic changes to your grammar, to set up support for naming the elements, which helps in diagramming and debugging.

    O = pp.Literal("O")
    H = pp.Literal("H")
    sym_e = sym_H ^ sym_O
    sym_H <<= (sym_H + sym_O) ^ (sym_O + sym_H) ^ H
    sym_O <<= (sym_H + sym_H) ^ O
    

    Then adding these two lines defines names for all the expression variables, and sets the debug flag for them.

    pp.enable_diag(
        pp.Diagnostics.enable_debug_on_named_expressions
    )
    pp.autoname_elements()
    

    Before running to get the debug output, I also called create_diagram() to see if there were any insights that might give:

    parser railroad diagram

    The debug output shows the progressive attempts at parsing the individual sub-expressions, and ending with:

    [... much output clipped ...]
    Match sym_O at loc 0(1,1)
      HOH
      ^
    Match sym_O failed, ParseException raised: Forward recursion without base case, found 'HOH'  (at char 0), (line:1, col:1)
    Match H at loc 0(1,1)
      HOH
      ^
    Matched H -> ['H']
    Match H at loc 0(1,1)
      HOH
      ^
    Matched H -> ['H']
    Matched sym_H -> ['H', 'O']
    Matched sym_e -> ['H', 'O']
    ['H', 'O']
    

    On a hunch, I reversed the order of the two expressions for sym_e:

    sym_e = sym_O ^ sym_H
    

    And this time, the debug output ended with:

    [... much output clipped ...]
    Match O failed, ParseException raised: Expected O, found 'HOH'  (at char 0), (line:1, col:1)
    Match sym_H at loc 0(1,1)
      HOH
      ^
    Matched sym_H -> ['H', 'O']
    Match sym_H at loc 2(1,3)
      HOH
        ^
    Matched sym_H -> ['H']
    Matched sym_O -> ['H', 'O', 'H']
    Matched sym_e -> ['H', 'O', 'H']
    ['H', 'O', 'H']
    

    A successful match!

    So I think the left-recursion logic may have issues in the face of expressions using the '^' operator, that it might be giving up a little too soon. Or that it is not smart enough to undo an initial sym_H match of the leading "HO" leaving the trailing H unparsed, to back up and choose a shorter sym_H match of the leading "H", so that the following "OH" would match the second sym_H needed for a matched sym_O.