functional-programmingbnfparser-combinatorsleft-recursion

Parser combinator for propositional logic


I'd like to design a combinator to parse Propositional Logic. Here's a simple BNF:

<sentence> ::= <atomic-sentence> | <complex-sentence>
<atomic-sentence> ::= True | False | P | Q | R
<complex-sentence> ::= (<sentence>)
                       | <sentence> <connective> <sentence>
                       | ¬<sentence>
<connective> ::= ∧ | ∨ | ⇒ | ⇔

The problem is that the grammar is left-recursive, which leads to an infinite loop: a sentence can be a complex sentence, which can start with a sentence, which can be a complex sentence, ... forever. Here's an example sentence that causes this problem:

P∧Q

Is there a simple way to fix the grammar so that it is suitable for a parser combinator? Thanks.

FWIW, I'm using FParsec in F#, but I think any parser combinator library would have the same issue.


Solution

  • FParsec can handle infix operators using the OperatorPrecedenceParser class where you just need to specify which operators with which associativity and precedence you have, without having to actually write a grammar for your infix expressions. The rest of this answer will explain how to solve the problem without this class for cases where the class doesn't apply, for parser combinators that don't have an equivalent class or in case you just plain don't want to use it or are at least interested in how you'd solve the problem without it.

    Parser combinators tend to not support left-recursion, but they do tend to support repetition. Luckily, a left-recursive rule of the form <a> ::= <a> <b> | <c> can be rewritten using the * repetition operator to <a> ::= <c> <b>*. If you then left-fold over the resulting list, you can construct a tree that looks just like the parse tree you'd have gotten from the original grammar.

    So if we first inline <complex-sentence> into <sentence> and then apply the above pattern, we get <a> = <sentence>, <b> = <connective> <sentence> and <c> = <atomic-sentence> | '(' <sentence> ')' | ¬<sentence>, resulting in the following rule after the transformation:

    <sentence> ::= ( <atomic-sentence>
                   | '(' <sentence> ')'
                   | ¬<sentence>
                   )* <connective> <sentence>
    

    To improve readability, we'll put the parenthesized part into its own rule:

    <operand>  ::= <atomic-sentence>
                 | '(' <sentence ')'
                 | ¬<sentence>
    <sentence> ::= <operand> (<connective> <sentence>)*
    

    Now if you try this grammar, you'll notice something strange: the list created by the * will only ever contain a single element (or none). This because if there's more than two operands, the right-recursive call to <sentence> will eat up all the operands, creating a right-associative parse tree.

    So really the above grammar is equivalent to this (or rather the grammar is ambiguous, but a parser combinator will treat it as if it were equivalent to this):

    <sentence> ::= <operand> <connective> <sentence>
    

    This happened because the original grammar was ambiguous. The ambiguous definition <s> ::= <s> <c> <s> | <o> can either be interpreted as the left-recursive <s> ::= <s> <c> <o> | <o> (which will create a left-associative parse tree) or the right-recursive <s> ::= <o> <c> <s> | <o> (right-associative parse tree). So we should first remove the ambiguity by choosing one of those forms and then apply the transformation if applicable.

    So if we choose the left-recursive form, we end up with:

    <sentence> ::= <operand> (<connective> <operand>)*
    

    Which will indeed create lists with more than one element. Alternatively if we choose the right-recursive rule, we can just leave it as-is (no repetition operator necessary) as there is no left-recursion to eliminate.

    As I said, we can now get a left-associative tree by taking the list from the left-recursive version and left-folding it or a right-associative one by taking the right-recursive version. However both of these options will leave us with a tree that treats all of the operators as having the same precedence.

    To fix the precedence you can either apply something like the shunting yard algorithm to the list or you can first re-write the grammar to take precedence into account and then apply the transformation.