regexparsinglanguage-agnosticcontext-free-grammarpeg

What are the differences between PEGs and CFGs?


From this wikipedia page:

The fundamental difference between context-free grammars and parsing expression grammars is that the PEG's choice operator is ordered. If the first alternative succeeds, the second alternative is ignored. Thus ordered choice is not commutative, unlike unordered choice as in context-free grammars and regular expressions. Ordered choice is analogous to soft cut operators available in some logic programming languages.

Why does PEG's choice operator short circuits the matching? Is it because to minimize memory usage (due to memoization)?

I'm not sure what the choice operator is in regular expressions but let's suppose it is this: /[aeiou]/ to match a vowel. So this regex is commutative because I could have written it in any of the 5! (five factorial) permutations of the vowel characters? i.e. /[aeiou]/ behaves the same as /[eiaou]/. What is the advantage of it being commutative? (c.f. PEG's non-commutativity)

The consequence is that if a CFG is transliterated directly to a PEG, any ambiguity in the former is resolved by deterministically picking one parse tree from the possible parses. By carefully choosing the order in which the grammar alternatives are specified, a programmer has a great deal of control over which parse tree is selected.

Is this saying that PEG's grammar is superior to CFG's?


Solution

  • A CFG grammar is non-deterministic, meaning that some input could result in two or more possible parse-trees. Though most CFG-based parser-generators have restrictions on the determinability of the grammar. It will give a warning or error if it has two or more choices.

    A PEG grammar is deterministic, meaning that any input can only be parsed one way.

    To take a classic example; The grammar

    if_statement := "if" "(" expr ")" statement "else" statement
                  | "if" "(" expr ")" statement;
    

    applied to the input

    if (x1) if (x2) y1 else y2
    

    could either be parsed as

    if_statement(x1, if_statement(x2, y1, y2))
    

    or

    if_statement(x1, if_statement(x2, y1), y2)
    

    A CFG-parser would generate a Shift/Reduce-conflict, since it can't decide if it should shift (read another token), or reduce (complete the node), when reaching the "else" keyword. Of course, there are ways to get around this problem.

    A PEG-parser would always pick the first choice.

    Which one is better is for you to decide. My opinion is that often PEG-grammars is easier to write, and CFG grammars easier to analyze.