What is non LL(*) about this right-recursive Antlr rule?

Consider the following Antlr3 grammar:

grammar stat;

s : 
  label ID '=' ID
| label 'return' ID
;

label
  : ID ':' label
|
;

ID: 'a'..'z' + ;

Given that Antlr3 is LL(*), which means the parser can look arbitrarily far ahead into the token stream when selecting an alternative, why am I getting the following error?

 [fatal] rule s has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

Question:

How can I fix the above grammar? (Explanation behind the fix would be greatly appreciated.)
What is * in LL(*) if not backtracking after an unsuccessful, arbitrarily long lookahead, versus the static k of LL(k)?
Isn't automatic left-factoring something automatically done by Antlr3?
Curious, how could this also be fixed via use of syntactic predicates?

I'm using Antlr 3.5.2 .

Solution

This exact example (character-by-character) is found on page 277 of the ANTLR3 version of The Definitive ANTLR Reference, which is available on-line (at least for limited previewing as well as paid download). It's used as an explanation of one of the aspects of the LL(*) algorithm used in ANTLR3, so the explanation on the following pages is probably the best detailed answer to your question. In addition, various possible solutions are discussed.

So let me focus on just one of your questions, which I think will also answer the question in the post title.

What is * in LL(*) if not backtracking after an unsuccessful, arbitrarily long lookahead?

LL(*) is a heuristic algorithm and not a formal language category; Terence Parr (the author of ANTLR) uses the term to label a particular parsing algorithm used in ANTLR3. (ANTLR4 uses a different algorithm, which Parr calls Adaptive LL(*). I'm not goint to get into that.)

LL parsing, in general terms, decides which production to use before starting to parse the production. LL(k) parsing makes that decision based on the following k tokens, where k is some fixed (and usually small) integer. That was basically the algorithm used in ANTLR2, and it's not really good enough to handle many grammars.

To make the algorithm more general, you can add backtracking (which ANTLR does); if after examining k tokens, the parser still can't decide which production to use, it tries them in order until it finds one which works. If an attempted production fails, the parser returns to the decision point and tries the next production. If a production succeeds, the parser accepts it and continues, even if that might lead to a parsing failure later on. (So the order of productions in the grammar is important in a backtracking parser.)

LL(*) does not involve backtracking. Rather, it constructs a finite-automaton-based forward scan which matches an arbitrarily long lookahead sequence against a regular expression (of tokens, not characters, but the principle is the same). If it can construct (while compiling the grammar) regular expressions for each candidate production which unambiguously distinguish between the alternatives, then it can rapidly run the finite automatons in parallel at a decision point and make a decision. The * in LL(*) makes reference to the fact that there is no arbitrary length limit on the lookahead; it is only necessary that the regular expressions diverge at some point in the future.

As a practical example, in the grammar you present the alternative

  label ID '=' ID

should be chosen if the lookahead matches (ID ':')* ID '=', while

| label 'return' ID

should be chosen if the lookahead matches (ID ':')* ID 'return'. These are evidently distinct regular expressions; at most one can match any lookahead sequence.

However, all of that is dependent on the parser generator being able to work out the lookahead patterns. And that's a difficult (possibly impossible) problem in the general case. ANTLR3 does its best, but it cannot cope with recursive non-terminals (like label) in the lookahead sequence.

Since it can cope with repetition operators, you can fix the issue by simply changing the definition of label to a non-recursive alternative. (This comes straight out of the book referenced earlier.)

    label: (ID ':')*

That definition is already in regular expression form, and ANTLR3 can just plug it in to the production to produce effective lookahead patterns.

That's all probably just fine in practice, since long sequences of labels are not likely to occur in practice. But personally, I'd go for a simple LL(2) solution which doesn't use a label non-terminal at all:

s : 
  ID '=' ID
| 'return' ID
| ID ':' s
;

This solution doesn't require an unlimited lookahead search. And it still lets you insert arbitrary actions for each alternative.