javaccambiguity

Finding source of choice conflict in JavaCC grammar


I have a JavaCC grammar with a troublesome section that can be reduced to this:

void Start(): {}
{
  A()
  <EOF>
}

void A(): {}
{
  ( "(" A() ")" | "biz" )
  ( B() | C() )*
}

void B(): {}
{ "foo" A() }

void C(): {}
{ "bar" A() }

When I compile the above grammar, JavaCC warns of a choice conflict on the line ( B() | C() )*. There are 2 things I'm trying to understand. The first is why it thinks there is a conflict in this case. AFAICT at each point it should be able to determine which path to take based on just the current token. The second one is how to get rid of the warning. I can't seem to find the right spot to put the LOOKAHEAD statement. No matter where I put it, I either get a warning that it isn't at a choice point or I continue to get the same warning. Here's what I thought it might like:

void Start(): {}
{
  A()
  <EOF>
}

void A(): {}
{
  ( "(" A() ")" | "biz" )
  ( LOOKAHEAD(1) B() | LOOKAHEAD(1) C() )*
}

void B(): {}
{ "foo" A() }

void C(): {}
{ "bar" A() }

But this still produces the warning. I've also tried various semantic lookahead statements with no luck. I'm clearly missing something, but I at a loss as to what. FWIW, putting any token after ( B() | C() )* also "fixes" the issue, so I'm guessing it has something to do with it not knowing how to exit that loop, but seem like that should just be when it doesn't see "foo" or "bar". The generated code appears to be correct, but if there is an ambiguity here I'm not seeing then obviously that won't matter.

EDIT 1..

After some poking about and looking at a Java grammar I found that this makes things happy:

void Start(): {}
{
  A()
  <EOF>
}

void A(): {}
{
  ( "(" A() ")" | "biz" )
  ( 
    LOOKAHEAD(2)
    (B() | C()) 
  )*
}

void B(): {}
{ "foo" A() }

void C(): {}
{ "bar" A() }

I'm still not entirely clear why it would need the extra token to decide which option to take in the loop (and maybe it really doesn't).

EDIT 2...

OK, I see the issue now, the ambiguity is not between B and C, but between whether to do a depth first or breadth first construction of the tree. So the following is just as ambiguous:

void Start(): {}
{
  A()
  <EOF>
}

void A(): {}
{
  ( "(" A() ")" | "biz" )
  (B())*
}

void B(): {}
{ "foo" A() }

Switching from * to ? resolves the ambiguity as suggested.

If we change B to `{ "foo" A() "end" } that also resolves the issue since there is a clear end to B. Now suppose we do this:

void Start(): {}
{
  A()
  <EOF>
}

void A(): {}
{
  ( "(" A() ")" | "biz" )
  ( B() | C() )*
}

void B(): {}
{ "foo" A() "end" }

void C(): {}
{ "bar" A() }

Here I would expect that the same issue would exist for C, but JavaCC still reports that the ambiguous prefix is "foo". Is that just a reporting error? Of course using ? in this case is not possible since then you can't match successive B's (which we want). FWIW, the code generated here produces the depth first tree and since that's what I want it may be sufficient to suppress the warning.


Solution

  • Your grammar is ambiguous. Consider the input

    biz foo biz foo biz EOF
    

    We have the following two leftmost derivations.

    The first is

        Start
     => ^^^^^ Expand
         A EOF
     =>  ^ Expand
         ( "(" A ")" | "biz") ( B | C)* EOF
     =>                ^^^^^ choose
        "biz" ( B | C )* EOF
     =>       ^^^^^^^^^^ unroll and choose the B
        "biz" B ( B | C )* EOF
     =>       ^ expand
        "biz" "foo" A ( B | C )* EOF
     =>             ^ expand
        "biz" "foo" ( "(" A ")" | "biz") ( B | C)* ( B | C )* EOF
     =>                           ^^^^^ choose
        "biz" "foo" "biz" ( B | C)* ( B | C )* EOF
     =>                   ^^^^^^^^^^ unroll and choose the B
        "biz" "foo" "biz" B ( B | C)* ( B | C )* EOF
     =>                   ^ expand
        "biz" "foo" "biz" "foo" A ( B | C)* ( B | C )* EOF
     =>                         ^ expand
        "biz" "foo" "biz" "foo" ( "(" A ")" | "biz") ( B | C)* ( B | C)* ( B | C )* EOF
     =>                                       ^^^^^ choose
        "biz" "foo" "biz" "foo" "biz" ( B | C)* ( B | C)* ( B | C )* EOF
     =>                               ^^^^^^^^^ Terminate
        "biz" "foo" "biz" "foo" "biz" ( B | C)* ( B | C )* EOF
     =>                               ^^^^^^^^^ Terminate
        "biz" "foo" "biz" "foo" "biz"( B | C )* EOF
     =>                               ^^^^^^^^^ Terminate
        "biz" "foo" "biz" "foo" "biz" EOF
    

    For the second derivation everything is the same as the first until the point where the parser has to decide whether to enter the second loop.

        Start
    =>*  As above
        "biz" "foo" "biz" ( B | C)* ( B | C )* EOF
    =>                    ^^^^^^^^^ Terminate!! (Previously it was expand)
        "biz" "foo" "biz" ( B | C )*
    =>                    ^^^^^^^^^^  Unroll and choose B
        "biz" "foo" "biz" B ( B | C )* EOF
    =>                    ^ Expand
        "biz" "foo" "biz" "foo" A ( B | C )*  EOF
    =>                          ^ Expand
        "biz" "foo" "biz" "foo" ( "(" A ")" | "biz") ( B | C)* ( B | C )*  EOF
    =>                                        ^^^^^ Choose
        "biz" "foo" "biz" "foo" "biz" ( B | C)* ( B | C )*  EOF
    =>                                ^^^^^^^^^ Terminate
        "biz" "foo" "biz" "foo" "biz"( B | C )* EOF
    =>                                ^^^^^^^^^ Terminate
        "biz" "foo" "biz" "foo" "biz" EOF
       
    

    In terms of parse trees, there are the following two possibilities. The parse tree on the left is from the first derivation. The parse tree on the right is from the second

    Start  --> A -+---------------------------> biz <--------------+- A <-- Start
                  |                                                |
                  +-> B -+--------------------> foo <------+-- B <-+
                         |                                 |       |
                         +-> A -+-------------> biz <- A <-+       |
                                |                                  |
                                +-> B -+------> foo <------+-- B <-+
                                       |                   |
                                       +-> A -> biz <- A <-+
    

    When you have a choice

    LOOKAHEAD(3) X() | Y()
    

    That means look ahead at most 3 tokens to try to determine if X() is wrong. If the next 3 tokens show that X() is wrong, the parser uses Y(), otherwise it goes with X(). When your grammar is ambiguous, no amount of looking ahead will resolve the conflict. For ambiguous input, looking ahead won't help the parser to figure out which choice is "right" and which is "wrong" because are "right".

    So why does JavaCC stop warning when you put in the "LOOKAHEAD" directive. It's not because the look-ahead issue has been solved. It's because when you put in a look-ahead directive, JavaCC always stops giving warnings. The assumption is that you know what you are doing, even if you don't.

    Often the best way to deal with look-ahead problems is to rewrite the grammar to be unambiguous and LL(1).

    So what should you do? I'm not sure, because I don't know what kind of parse tree you prefer. If it's the one on the left, I think changing the * to a ? will fix the issue.

    If you like the parse tree on the right, I think the following grammar will do it

    void Start(): {}
    {
      A()
      <EOF>
    }
    
    void A(): {}
    {
      SimpleA()
      ( B() | C() )*
    }
    
    void SimpleA() : {}
    {
       "(" A() ")" | "biz" 
    }
    
    void B(): {}
    { "foo" SimpleA() }
    
    void C(): {}
    { "bar" SimpleA() }