I am trying to apply pyparsing
to solve the Advent of Code 2015 Day 19 puzzle.
The simplest example shows an ambiguous grammar e => H; e => O; H => HO; H => OH; O => HH
and seeks the number of replacements to parse HOH
from the start symbol e
. One such solution is e -> O -> HH -> HOH
.
I attempted the following (which doesn't yet count the number of replacements):
import pyparsing as pp
pp.ParserElement.enable_left_recursion()
sym_H = pp.Forward()
sym_O = pp.Forward()
sym_e = sym_H ^ sym_O
sym_H <<= (sym_H + sym_O) ^ (sym_O + sym_H) ^ 'H'
sym_O <<= (sym_H + sym_H) ^ 'O'
# This should parse the entire string using e -> O -> HH -> HOH but fails?
print(sym_e.parse_string('HOH')) # Shows: ['H', 'O'].
This code uses the Or
(^
) operator rather than the usual MatchFirst
(|
) because the documentation explains that it should find a longest possible match.
Is there intuition for why parse_string
cannot find a (non-unique) expansion that matches the full string? Is there perhaps a simple code modification to make it work?
Someone successfully used CYK parsing on this puzzle; is that parsing algorithm more general (but slower) than the one in pyparsing
?
Interesting problem, and it touches on a feature in pyparsing that is still relatively new to me, the support for left-recursion.
I made some essentially cosmetic changes to your grammar, to set up support for naming the elements, which helps in diagramming and debugging.
O = pp.Literal("O")
H = pp.Literal("H")
sym_e = sym_H ^ sym_O
sym_H <<= (sym_H + sym_O) ^ (sym_O + sym_H) ^ H
sym_O <<= (sym_H + sym_H) ^ O
Then adding these two lines defines names for all the expression variables, and sets the debug flag for them.
pp.enable_diag(
pp.Diagnostics.enable_debug_on_named_expressions
)
pp.autoname_elements()
Before running to get the debug output, I also called create_diagram() to see if there were any insights that might give:
The debug output shows the progressive attempts at parsing the individual sub-expressions, and ending with:
[... much output clipped ...]
Match sym_O at loc 0(1,1)
HOH
^
Match sym_O failed, ParseException raised: Forward recursion without base case, found 'HOH' (at char 0), (line:1, col:1)
Match H at loc 0(1,1)
HOH
^
Matched H -> ['H']
Match H at loc 0(1,1)
HOH
^
Matched H -> ['H']
Matched sym_H -> ['H', 'O']
Matched sym_e -> ['H', 'O']
['H', 'O']
On a hunch, I reversed the order of the two expressions for sym_e
:
sym_e = sym_O ^ sym_H
And this time, the debug output ended with:
[... much output clipped ...]
Match O failed, ParseException raised: Expected O, found 'HOH' (at char 0), (line:1, col:1)
Match sym_H at loc 0(1,1)
HOH
^
Matched sym_H -> ['H', 'O']
Match sym_H at loc 2(1,3)
HOH
^
Matched sym_H -> ['H']
Matched sym_O -> ['H', 'O', 'H']
Matched sym_e -> ['H', 'O', 'H']
['H', 'O', 'H']
A successful match!
So I think the left-recursion logic may have issues in the face of expressions using the '^' operator, that it might be giving up a little too soon. Or that it is not smart enough to undo an initial sym_H
match of the leading "HO" leaving the trailing H unparsed, to back up and choose a shorter sym_H
match of the leading "H", so that the following "OH" would match the second sym_H
needed for a matched sym_O
.