I'm writing a parser-combinator library in JS, able to express and evaluate EBNF-style CFGs (grammars), such as can be validated/verified here.
For example, the EBNF grammar S := ("a" | ("b" "c"?)*) "d"
, matching inputs like "ad"
or "bcbd"
, could be run in JS somewhat like this:
grammar = and(
or(
terminal("a"),
any(
and(
terminal("b"),
optional(terminal("c"))
)
)
),
terminal("d")
);
grammar.match("ad"); // true
grammar.match("bcbd"); // true
grammar.match("abcd"); // false
The combinators currently don't do any backtracking, and are eager/short-circuited. So the or(..)
alternation combinator simply stops after it finds the first matching production passed to it.
Thus, this library currently fails with this grammar (S := "a" ("b" | ("b" "c")) "d"
) on "abcd"
input, even though it should match. However, it works fine with this grammar (S := "a" (("b" "c") | "b") "d"
), because the alternation was conveniently re-ordered.
As I was contemplating how to implement backtracking for the library to fix this issue, a simple "greedy" algorithm (for alternation) came to mind: instead of short-circuiting the or(..)
, try all productions (keeping track of the length of all matched results), and then choose the first one with the longest match.
With such an algorithm, the "abcd"
would indeed match, since "bc"
is a longer match than "b"
in the alternation.
For clarity sake, the library only needs to determine if there's any parse that fully matches the input string. I'm not concerned with grammar ambiguity that could produce two or more distinct parse trees. Left-recursion is still a concern, but that's a separate topic to tackle.
The imagined greedy-OR approach seems plausible to implement. But before I go down that route, I'm worried that this is maybe too naive, and doesn't represent other various kinds of backtracking these combinators would need.
So... finally my question: Is there another EBNF grammar + input example that would still fail to match with the above described greedy-OR algorithm, where a more generalized backtracking would be required? IOW, what is the minimal/limited backtracking (versus unbounded) I actually need to implement here?
Each time I try to google for, or conceive of, a grammar (with alternation, repetition, etc) that might fail, I think the greedy-OR approach "fixes" it. But I readily admit that I'm potentially just missing some obvious gotchas (maybe with repetition and optional?).
Actually, I just found an example where the naive greedy-OR breaks, and thus more general backtracking is required:
S := ("ab" | "a") "b";
The input "ab"
should match, but it wouldn't: greedy-OR takes the "ab"
instead of just "a"
, meaning that a final "b"
in the input isn't available to satisfy the remainder of the S
production.
I dunno why I couldn't figure out such a simple example until after I wrote out the whole question above; must be the rubber-ducky effect! In any case, I think this is clear proof that the greedy-OR is necessary but insufficient.
And of course, now I also see how repetition alone (with no alternation at all) leads to the need for backtracking:
S := "a"* "a";
If *
/ any()
is greedy (it is!), then this production will never match a string of "aaaaa"
without backtracking included.