I am trying to write a simple parser using lemon, for a javascript-like language. I am unable to resolve a conflict error, and I suspect it is a unsolvable problem.
The conflict is between the grammar for:
{x = 10;}
and
{x:10};
The first is a statement block containing an assignment statement and the second is an expression statement defining an object.
A grammar to parse both of them results in a conflict. The minimal code is as follows:
rMod ::= rStmt.
rStmt ::= rStmtList RCURLY. {leaveScope();}
rStmtList ::= rStmtList rStmt.
rStmtList ::= LCURLY. {enterScope();}
rStmt ::= rExpr SEMI.
rExpr ::= rObj.
rObj ::= LCURLY rObjItemList RCURLY.
rObjItemList ::= rObjItemList COMMA rObjItem.
rObjItemList ::= rObjItem.
rObjItem ::= ID COLON rExpr.
rExpr ::= ID.
rExpr ::= NUM.
The out file shows the following:
State 4:
(3) rStmtList ::= LCURLY *
rObj ::= LCURLY * rObjItemList RCURLY
rObjItemList ::= * rObjItemList COMMA rObjItem
rObjItemList ::= * rObjItem
rObjItem ::= * ID COLON rExpr
ID shift 8
ID reduce 3 ** Parsing conflict **
rObjItemList shift 6
rObjItem shift-reduce 8 rObjItemList ::= rObjItem
{default} reduce 3 rStmtList ::= LCURLY
Any suggestions on how I can resolve this would be gratefully accepted. Thanks.
The heart of the problem is that you want to execute enterScope()
after the brace which initiates a statement block. However, if the brace is followed by the two tokens VAR
and :
, then it starts an object literal, not a block. So it is impossible to know whether or not to execute the enterScope
action without two-token lookahead, and lemon does not produce LR(2) grammars. To that extent, you are correct that the problem is unsolvable. But of course there are solutions.
Probably the worst solution from any perspective (readability, complexity, verificability) is to create an LR(1) grammar using the usual LR(2)→LR(1) transformation, which will allow you to call the enterScope();
action at the point where it is clear that a scope has been entered. This means delaying the reduction by one token. That in turn means dividing expr
into two disjoint non-terminals: those expr
which can start with a VAR
and those which cannot. For those expr
which can start with a VAR
, you also need to provide a mechanism which essentially allows you to glue together a VAR
and the rest of the expr
; in the case of expressions, that is particularly ugly (but still possible). The goal is to be able to write:
block(A) ::= blockPrefix(B) RCURLY . { closeScope(); A = B;}
blockPrefix(A) ::= lcurlyOpen exprNotStartingVAR(E) . { A = E; }
blockPrefix(A) ::= lcurlyVAR(V) restOfExprStartingVar(R) . { A = makeExpr(V, R); }
blockPrefix(A) ::= blockPrefix(B) SEMI expr(E) . { A = appendExpr(B, E); }
lcurlyOpen ::= LCURLY . { openScope(); }
lcurlyVAR(A) ::= LCURLY VAR(V) . { openScope(); A = V; }
An alternative, which is also ugly but probably less ugly in this particular case, is to recognize a variable name followed by a colon as a single lexical token (VAR_COLON
). Although that complicates the lexer (particularly since you need to recognize constructs where whitespace or even comments appear between the variable name and the colon), it makes the grammar much simpler. With that change, there is no conflict because the object literal must start with a VAR_COLON
while an expr can only start with a VAR
(or other unrelated tokens).
A much simpler solution is to not try to create the scope inherited attribute. If we do scope resolution synthetically, then the problem more or less vanishes:
stmt ::= expr SEMI | block .
stmtList ::= stmt .
stmtList ::= stmtList stmt .
block(A) ::= LCURLY stmtList(B) RCURLY . { A = applyScope(newScope(), B); }
objLiteral ::= LCURLY itemList RCURLY .
objLiteral ::= LCURLY RCURLY .
itemList ::= item .
itemList ::= itemList COMMA item .
item ::= VAR COLON expr .
expr ::= VAR .
expr ::= objLiteral .
...
That grammar has no conflicts, but it might radically change the way you handle scopes, since it requires variable names to be scoped once a block is complete rather than doing it in-line as the parse proceeds.
However, I would argue that for most languages (including Javascript), it is actually more convenient to do scoping at the end of a block, or even as a post-parse walk over the AST. Javascript, unlike C, allows local variables to be declared after their first mention. Local functions can even be used before their declaration. (This is subtly different from Python, where a function declaration is an executable assignment, but the scoping rules are similar.)
As another example, C++ allows class members to be declared anywhere inside the declaration of the class, even if the member has already been mentioned inside another class member function.
And there are many other examples. These scoping rules generally benefit the programmer by allowing stylistic options (such as putting member variable definitions at the end of a class definition in C++) which would not be possible in C.