parsinggrammarparenthesesrecursive-descentambiguous-grammar

Parser/Grammar: 2x parenthesis in nested rules


Despite my limited knowledge about compiling/parsing I dared to build a small recursive-descent parser for OData $filter expressions. The parser only needs to check the expression for correctness and output a corresponding condition in SQL. As input and output have almost the same tokens and structure, this was fairly straightforward, and my implementation does 90% of what I want.

But now I got stuck with parentheses, which appear in separate rules for logical and arithmetic expressions. The full OData grammar in ABNF is here, a condensed version of the rules involved is this:

boolCommonExpr = ( boolMethodCallExpr 
                 / notExpr  
                 / commonExpr [ eqExpr / neExpr / ltExpr / ... ]
                 / boolParenExpr
                 ) [ andExpr / orExpr ] 
commonExpr = ( primitiveLiteral
             / firstMemberExpr  ; = identifier
             / methodCallExpr 
             / parenExpr 
             ) [ addExpr / subExpr / mulExpr / divExpr / modExpr ]  
boolParenExpr = "(" boolCommonExpr ")"
parenExpr     = "(" commonExpr ")"

How does this grammar match a simple expression like (1 eq 2)? From what I can see all ( are consumed by the rule parenExpr inside commonExpr, i.e. they must also close after commonExpr to not cause an error and boolParenExpr never gets hit. I suppose my experience / intuition on reading such a grammar is just insufficient to get it. A comment in the ABNF says: "Note that boolCommonExpr is also a commonExpr". Maybe that's part of the mystery?

Obviously an opening ( alone won't tell me where it's going to close: After the current commonExpr expression or further away after boolCommonExpr. My lexer has a list of all tokens ahead (URL is very short input). I was thinking to use that to find out what type of ( I have. Good idea?

I'd rather have restrictions in input or a little hack than switching to a generally more powerful parser model. For a simple expression translation like this I also want to avoid compiler tools.


Edit 1: Extension after answer by rici - Is grammar rewrite correct?

Actually I started out with the example for recursive-descent parsers given on Wikipedia. Then I though to better adapt to the official grammar given by the OData standard to be more "conformant". But with the advice from rici (and the comment from "Internal Server Error") to rewrite the grammar I would tend to go back to the more comprehensible structure provided on Wikipedia.

Adapted to the boolean expression for the OData $filter this could maybe look like this:

boolSequence= boolExpr {("and"|"or") boolExpr} .
boolExpr    = ["not"] expression ("eq"|"ne"|"lt"|"gt"|"lt"|"le") expression .
expression  = term {("add"|"sum") term} .
term        = factor {("mul"|"div"|"mod") factor} .
factor      = IDENT | methodCall | LITERAL | "(" boolSequence")" .
methodCall  = METHODNAME "(" [ expression {"," expression} ] ")" .

Does the above make sense in general for boolean expressions, is it mostly equivalent to the original structure above and digestible for a recursive descent parser?

@rici: Thanks for your detailed remarks on type checking. The new grammar should resolve your concerns about precedence in arithmetic expressions.

For all three terminals (UPPERCASE in the grammar above) my lexer supplies a type (string, number, datetime or boolean). Non-terminals return the type they produce. With this I managed quite nicely do type checking on the fly in my current implementation, including decent error messages. Hopefully this will also work for the new grammar.


Edit 2: Return to original OData grammar

The differentiation between a "logical" and "arithmetic" ( is not a trivial one. To solve the problem even N.Wirth uses a dodgy workaround to keep the grammar of Pascal simple. As a consequence, in Pascal an extra pair of () is mandatory around and and or expressions. Neither intuitive nor OData conformant :-(. The best read about the "() difficulty" I found is in Let's Build a Compiler (Part VI). Other languages seem to go to great length in the grammar to solve the problem. As I don't have experience with grammar construction I stopped doing my own.

I ended up implementing the original OData grammar. Before I run the parser I go over all tokens backwards to figure out which ( belong to a logical/arithmetic expression. Not a problem for the potential length of a URL.


Solution

  • Personally, I'd just modify the grammar so that it has only one type of expression and therefore one type of parenthesis. I'm not convinced that the OData grammar is actually correct; it is certainly not usable in an LL(1) (or recursive descent) parser for exactly the reason you mention.

    Specifically, if the goal is boolCommonExpr, there are two productions which can match the ( lookahead token:

    boolCommonExpr = ( … 
                     / commonExpr [ eqExpr / neExpr / … ]
                     / boolParenExpr
                     / …
                     ) …
    commonExpr     = ( …
                     / parenExpr
                     / …
                     ) …
    

    For the most part, this is a misguided attempt to make the grammar detect a type violation. (If in fact it is a type violation.) It's misguided because it is doomed to failure if there are boolean variables, which there apparently are in this environment. Since there is not syntactic clue as to the type of a variable, the parser is not capable of deciding whether particular expressions are well-formed or not, so there is a good argument for not trying at all, particularly if it creates parsing headaches. A better solution is to first parse the expression into an AST of some form, and then do another pass over the AST to check that each operators has operands of the correct type (and possibly inserting explicit cast operators if that is necessary).

    Aside from any other advantage, doing the type check in a separate pass lets you produce much better error messages. If you make (some) type violations syntax errors, then you may leave the user puzzled about why their expression was rejected; in contrast, if you notice that a comparison operation is being used as an operand to multiply (and if your language's semantics don't allow an automatic conversion from True/False to 1/0), then you can produce a well-targetted error message ("comparisons cannot be used as the operand of an arithmetic operator", for example).

    One possible reason to put different operators (but not parentheses) into different grammatical variables is to express grammatical precedence. That consideration might encourage you to rewrite the grammar with explicit precedence. (As written, the grammar assumes that all arithmetic operators have the same precedence, which would presumably lead to 2 + 3 * a being parsed as (2 + 3) * a, which might be a huge surprise.) Alternatively, you might use some simple precedence aware subparser for expressions.