I get 2 shift/reduce errors when I try to compile the following grammar which is part of a larger one:
%token NAMESPACE IDENTIFIER
%start statement
%%
post_expression
: IDENTIFIER
| post_expression '[' expression ']'
| post_expression '.' IDENTIFIER '(' ')'
| post_expression '.' IDENTIFIER
;
expression
: post_expression
| expression '<' post_expression
| expression '>' post_expression
;
data_type
: IDENTIFIER
| IDENTIFIER '<' data_type_list '>'
| IDENTIFIER '<' '>'
| IDENTIFIER '[' ']'
;
statement
: expression ';'
| data_type initializer ';'
;
initializer
: IDENTIFIER
| IDENTIFIER '=' expression
;
data_type_list
: data_type
| data_type_list ',' data_type
;
%%
the conflict state is as follows:
State 1
1 post_expression: IDENTIFIER .
8 data_type: IDENTIFIER .
9 | IDENTIFIER . '<' data_type_list '>'
10 | IDENTIFIER . '<' '>'
11 | IDENTIFIER . '[' ']'
'[' shift, and go to state 6
'<' shift, and go to state 7
IDENTIFIER reduce using rule 8 (data_type)
'[' [reduce using rule 1 (post_expression)]
'<' [reduce using rule 1 (post_expression)]
$default reduce using rule 1 (post_expression)
Could someone please explain how to fix this error? Is it possible to use precedence to solve the problem?
These two conflicts are the result of the parser being limited to a single lookahead token.
Consider the two rules:
post_expression : post_expression '[' expression ']'
data_type : IDENTIFIER '[' ']'
and think about what happens when the parser sees an IDENTIFIER
followed by a [
. Since IDENTIFIER
would match post_expression
, both of these rules are active. But the parser cannot proceed without deciding whether or not to reduce the IDENTIFIER
to a post_expression
, because in a shift-reduce parser, reduction actions must be taken at the end of the production, and no later. If the parser could look two tokens into the remaining input, there wouldn't be a problem; either the second-next token is a ]
or it isn't (although in the real grammar, the decision might be more complicated).
A very similar issue occurs with:
expression
: expression '<' post_expression
data_type
: IDENTIFIER '<' data_type_list '>'
: IDENTIFIER '<' '>'
Again, the decision about whether to reduce IDENTIFIER
to expression
needs to be taken before the parser has enough information to make the decision.
You can't fix these problems with precedence rules, because precedence rules don't extend the parser's lookahead. With or without precedence rules, the parser cannot make the correct decision without seeing (at least) the second-next token.
This is not a conflict you can ignore. Bison will resolve the conflict one way or another (by default, it will choose the shift action, meaning that it will never do the reduction), but that resolution will not be correct for many inputs, and these will generate syntax errors because the parser has been forced into a dead-end.
So, what to do?
First possibility: eliminate the need to make a choice. You could accept a much more general syntax (such as post_expression '[' ']'
for data_type
) and then use a semantic rule to reject parses in which the post_expression
was not an IDENTIFIER
. Or you could arrange for post_expression
to not match IDENTIFIER
, by dividing expression
into two categories: IDENTIFIER
and complex_expression
.
Both of these alternatives are ugly; the first one requires validating the AST of the post_expression
, which ties the parser more tightly to the details of the AST than you might want, and the second one requires extensive modification to the grammar, reducing grammar readability and making grammar maintenance more difficult. But both of these do have one advantage: you can produce much better error messages.
Second possibility: Maybe it's not the case that data_type
can start with any IDENTIFIER
, and maybe the IDENTIFIER
s in data_type
cannot really be expression
s. That's certainly the case in some popular programming languages, where types and expressions are in the same namespace but don't respond to the same syntax. If that's the case, you might be able to tweak your lexical scanner to produce different tokens for value identifiers and type identifiers. (That's a very common strategy in C parsers, for example.) But that's also a bit ugly, because it probably means that the lexer needs to have access to the symbol table, and maybe even to name resolution functions, which might not even be needed by the parser. So that will require a much tighter coupling between name resolution, parsing and lexical scanning than good program design might suggest.
Nonetheless, if it is practical, I'd basically go with it; "good design" is not meant to exclude reasonable solutions. Just take care to keep the interface between the various components as simple as possible, without unnecessarily adding new backdoors into opaque data structures.
With Bison, you have a third alternative: you can ask Bison to create a GLR parser, which is not limited by lookahead. The GLR parser can handle arbitrarily long lookahead by simply maintaining all possibilities in parallel until enough information is received to resolve it.
Bison's GLR algorithm doesn't like real ambiguity, but that's not usually a problem because programming languages shouldn't be ambiguous. Still, it's sometimes necessary to allow the ambiguous parse because it cannot really be resolved without semantic information from a subsequent pass. In that case, you can use a custom merge procedure to create an AST with alternative branches.
Even if you ask Bison to generate a GLR-parser, Bison will produce the conflict warnings, on the basis that you probably want to know about them. However, if your grammar is not ambiguous, you don't need to worry about them. If your grammar is ambiguous and you try to parse an input with multiple possible parses, the parser will recognise the problem at runtime, and issue an error message. Good testing will help identify the possible ambiguities, but it doesn't provide a proof of unambiguity.