I've recently wrapped my head around LALR enough to write an LALR generator, and I'm trying to construct a java- or c#-style grammar for it (the beginnings of which are specified here).
I know it's extra effort to write the parser generator, like re-inventing the wheel (why not use Antlr?), but my goal is to bootstrap a hobby OS that can compile itself without depending on a third-party toolchain. My problem though is not with the generator, but with the grammar.
I'm running into reduce/reduce ambiguities with statements and expressions.
I know how to resolve certain types of ambiguities, such as dangling-else, but these few have are not intuitive to me, and they have me stumped.
What's the best way to resolve these? Also, is there a prototyping tool that I can use to help visualize the solution? Or, should I go back to square one and try implementing a GLR parser generator for the grammar?
These statements are legal:
Generic.List<int> myVar1 = x + 4, myVar2; // stmt -> var-decl ;
// var-decl -> type-name var-decl-list
t = 99; // simple-stmt -> assign
i++; // simple-stmt -> incr-decr
// incr-decr -> primary-expr ++
json.deserialize<list<int>>(obj); // simple-stmt -> call
// call -> primary-expr ( params )
// ... -> primary-expr . basic-name ( params )
// ... -> basic-name . basic-name ( params )
Here's how it's set up:
basic-name : ident < type-list >
| ident
nested-name : nested-name . basic-name
| basic-name
basic-type : int | bool | ...
type-name : nested-name
| basic-type
stmt : var-decl ;
| simple-stmt ;
| ...
var-decl : type-name var-decl-list
var-decl-list : var-decl-list , var-init
| var-init
var-init : ident assign-op expression
| ident
simple-stmt : assign
| call
| incr-decr
expr : assign-expr
assign-expr : assign
| cond-expr
assign : unary-expr assign-op expr
...
rel-expr : rel-expr < shift-expr
...
| shift-expr
...
unary-expr : unary-op primary-expr
| primary-expr
unary-op : + - ! ~ ++ -- // Prefix operators
| ( type-name ) // Conversion operator
primary-expr : call
| primary-expr . basic-name
| primary-expr ++
| ( expr )
...
| basic-name
call : primary-expr ( params )
incr-decr : primary-expr ++
| -- primary-expr
| ...
So when the parser is expecting a statement, the *LR(k) item set kernel is method-body -> { * stmts-opt }
and the full item set for the state looks something like this:
method-body -> { * stmts-opt }
stmts-opt -> * stmts
stmts-opt -> *
stmts -> * stmts stmt
stmt -> * var-decl ;
stmt -> * simple-stmt ;
var-decl -> * type-name var-decl-list
simple-stmt -> * assign
simple-stmt -> * call
simple-stmt -> * incr-decr
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
incr-decr -> * primary-expr ++
...
When an identifier is shifted, the next state is:
basic-name -> ident *
basic-name -> ident * < type-list >
which is parsed out or reduced, and brings the next state to:
nested-name -> basic-name *
primary-expr -> basic-name *
Potential conflict. In the parent state, lookaheads don't help, since there is a dot in nested-name
and primary-expr
. Oh, goody, let's try reducing by nested-name:
type-name -> nested-name *
nested-name -> nested-name * . basic-name
Nothing to see here...
Now, how about reducing by primary-expr
instead:
unary-expr -> primary-expr *
primary-expr -> primary-expr * . basic-name
primary-expr -> primary-expr * ++
call -> primary-expr * ( params )
incr-decr -> primary-expr * ++
...
Now when we shift ++, we get:
primary-expr -> primary-expr ++ *
incr-decr -> primary-expr ++ *
...another reduce-reduce conflict.
Let's try shifting the (
instead of the ident
:
primary-expr -> ( * expr )
unary-op -> ( * type-name )
expr -> * assign-expr
assign-expr -> * assign
assign-expr -> * cond-expr
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
cond-expr -> * ...
...
rel-expr -> * rel-expr < shift-expr
rel-expr -> * shift-expr
...
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
Same issues when you shift an ident
or (
onto the stack.
These are just the ones I've run into so far. Since basic-name
has precedence over rel-expr
, I'm assuming something like x < n
would be interpreted as basic-name -> ident < type-list *
, and error out if it was actually a relational expression.
My brain has reached the point where I could really use some help.
There are a few questions in your post, which makes it not really ideal for SO. But I'll try to provide some thoughts about each one. As I see it, you have three issues:
Distinguishing expression statements from expressions which are not statements.
Parsing hierarchically-named types in a declaration without conflicting with field-access expressions in expression statements
Distinguishing between the use of < as a comparison operator and as a template bracket.
As I understand, the desire is to only allow as statements expressions which have (or potentially have) some kind of side-effect: assignments, increment mutations, and subroutine calls. Roughly speaking, this corresponds to Java, whose grammar includes:
StatementExpression:
Assignment
PreIncrementExpression
PreDecrementExpression
PostIncrementExpression
PostDecrementExpression
MethodInvocation
ClassInstanceCreationExpression
Each of the alternatives listed for StatementExpression
also appears somewhere in the derivation tree for Expression
, where they have been factored out of the lists of possibilities. Here's a very condensed sample:
Expression:
LambdaExpression
AssignmentExpression
AssignmentExpression:
ConditionalExpression
Assignment
Assignment:
LeftHandSide AssignmentOperator Expression
...
UnaryExpression:
PreIncrementExpression
+ UnaryExpression
UnaryExpressionNotPlusMinus
PreIncrementExpression:
++ UnaryExpression
UnaryExpressionNotPlusMinus:
PostfixExpression
~ UnaryExpression
PostfixExpression:
Primary
ExpressionName
PostIncrementExpression
PostIncrementExpress:
PostfixExpression ++
Note how the non-terminals used on the right-hand side of ExpressionStatement
are special-cased at each precedence level. In the C++ grammar, which doesn't limit which expressions can be statements, there is no need for a separate Assignment
non-terminal:
assignment-expression:
conditional-expression
logical-or-expression assignment-operator initializer-clause
But in Java, that won't work. It needs to create the additional non-terminal which does not derive ConditionalExpression
, precisely in order to allow Assignment
to be a Statement
and AssignmentExpression
to be an Expression
.
Similar to the above, here it is necessary to put hierarchical names (which must start with a basic-name
) from other forms of field access expressions, which might start with any (other) primary-expr
. The former can be type names or primary expressions; the latter can only be type names. To make this distinction, we need to split the primary-expr
production:
primary-expr : field-access-expr
| nested-name
non-field-access-expr:
call
| post-increment-expression // was primary-expr ++
| ( expr )
...
field-access-expr :
non-field-access-expr
| field-access-expr . basic-name
Unlike the other two issues, this one might actually be an ambiguity in the language. In C++, for example, template brackets are definitely ambiguous; they can only be resolved by knowing (or being told) whether a particular name is a template name or not. In Java, on the other hand, the ambiguity is solved by sometimes requiring the type parameters to come before the generic name. For example:
ConstructorDeclarator:
[TypeParameters] SimpleTypeName ( [FormalParameterList] )
or
MethodInvocation:
Primary . [TypeArguments] Identifier ( [ArgumentList] )
In C#, there is yet a different rule, which requires looking at the character following the > which potentially matches the opening <. The algorithm is described in §7.6.4.2 of the C# manual; I have no idea how you would implement that in an LALR(1) parser.