parsingantlrgrammarll-grammar

Is this rule LL(1)?


On a sort of academic note, I am wondering whether the following rule is LL(1) or 2:

tableItem:
    Identifier (Dot Identifier)? (Dot Identifier)? (Dot Identifier)?        # simplePath
    | OpenParen selectStatement CloseParen                                  # subSelect
    | OpenParen VALUES OpenParen expr (Comma expr)* CloseParen CloseParen   # valuesClause
    ;

The first character will determine whether it is an ( or not and the second character will determine whether it is a vV or not, and so in two characters I believe we resolve the rule. However, is that considered LL(1) or LL(2) since it takes two tokens? If it's LL(2), how would the above be re-written so that it is LL(1)?


Solution

  • In an LL(1) grammar, you have to be able to predict which production to apply from the first lookahead character. So if you have two productions for the same non-terminal which start with the same terminal, that grammar is not LL(1). It can be made LL(1) by left-factoring, since there is no left-recursion.

    If you want to talk about LL(k) in formal terms, you have to first reduce the grammar to BNF. Antlr-style EBNF is outside of the formal theory. (It is possible to extend formal parsing theory to what are generally called "regular right-hand sides", but you need slightly different parsing algorithms, and since LL and LR (for example) describe specific parsing algorithms, the definitions change as well.)

    Reducing that grammar to BNF will reveal some of its other issues; for example, the first production as written is ambiguous because it's impossible to know which of the (Dot Identifier)? has been omitted if they are not all present. (Antlr does this greedily, but greedy parsing and ordered alternatives also change the landscape.) Writing it out as BNF would trivially disambiguate, but of course the grammar would be longer. (Why is it limited to three (Dot Identifier)? though? Wouldn't (Dot Identifier)* make more sense?)

    Leaving that aside, the left-factoring goes like this. We start with:

    tableItem
        : OpenParen selectStatement CloseParen
        | OpenParen VALUES OpenParen expr (Comma expr)* CloseParen CloseParen
    
    1. Remove the Kleene star operator, right-recursive version
    tableItem
        : OpenParen selectStatement CloseParent
        | OpenParen VALUES OpenParen exprList CloseParen CloseParen
    exprList
        : expr
        | expr Comma exprList
    
    1. Left factor, to avoid two productions starting expr and two productions starting OpenParen:
    tableItem
        : OpenParen tableItemRest
    tableItemRest
        : selectStatement CloseParen
        | VALUES OpenParen exprList CloseParen CloseParen
    exprList
        : expr exprListRest
    exprListRest
        : ε
        | Comma exprList
    

    Assuming that expr and selectStatement are both LL(1), that's an LL(1) grammar.