I'm writing a parser for Unrealscript using PLY, and I've run into (hopefully) one of the last ambiguities in the parsing rules that I've set up.
Unrealscript has a keyword, default
, which is used differently depending on the context. In a regular statement line, you could use default
like so:
default.SomeValue = 3; // sets the "default" class property to 3
There is also, of course, the default
case label for switch
statements:
switch (i) {
case 0:
break;
default:
break;
}
There is an ambiguity in the parsing when it encounters the default
label within what it thinks is the case
's statement block. Here is an example file that runs into a parsing error:
class Example extends Object;
function Test() {
switch (A) {
case 0:
default.SomeValue = 3; // OK
default: // ERROR: Expected "default.IDENTIFIER"
break;
}
}
Here are the rules that are in conflict:
All of the rules can be seen in their entirety on GitHub.
default
def p_default(p):
'default : DEFAULT PERIOD identifier'
p[0] = ('default', p[3])
switch
def p_switch_statement(p):
'switch_statement : SWITCH LPAREN expression RPAREN LCURLY switch_cases RCURLY'
p[0] = ('switch_statement', p[3], p[6])
def p_switch_case_1(p):
'switch_case : CASE primary COLON statements_or_empty'
p[0] = ('switch_case', p[2], p[4])
def p_switch_case_2(p):
'switch_case : DEFAULT COLON statements_or_empty'
p[0] = ('default_case', p[3])
def p_switch_cases_1(p):
'switch_cases : switch_cases switch_case'
p[0] = p[1] + [p[2]]
def p_switch_cases_2(p):
'switch_cases : switch_case'
p[0] = [p[1]]
def p_switch_cases_or_empty(p):
'''switch_cases_or_empty : switch_cases
| empty'''
p[0] = p[1]
Any help or guidance on how to resolve this conflict will be greatly appreciated! Thank you in advance.
What you have here is a simple shift/reduce conflict (with the token default
as lookahead) being resolved as a shift.
Let's reduce this all to a much smaller, if not minimal example. Here's the grammar, partly based on the one in the Github repository pointed to in the OP (but intended to be self-contained):
statements: statements statement |
statement : assign SEMICOLON
| switch
assign : lvalue EQUALS expression
switch : SWITCH LPAREN expression RPAREN LCURLY cases RCURLY
cases : cases case |
case : CASE expression COLON statements
| DEFAULT COLON statements
expression: ID | INT
lvalue : ID | DEFAULT
The key here is that a statement
might start with the token DEFAULT
, and a case
might also start with the token DEFAULT
. Now, suppose we've reached the following point in the parse:
switch ( <expression> ) { <cases> case <expression> : <statements>
so we're in the middle of a switch
compound statement; we've seen case 0:
and we're working on a list of statements. The current state includes the items (there are a few more; I only include the relevant ones):
1. statements: statements · statement
2. case : CASE expression COLON statements ·
3. statement : · assign SEMICOLON
4. assign : · lvalue EQUALS expression
5. lvalue : · DEFAULT
The lookahead set for item 2 is [ RCURLY, DEFAULT, ID ]
.
Now suppose the next token is default. We could be looking at the start of a statement, if the default is followed by =. Or we could be looking at a new case clause, if the default is followed by :. But we can't see two tokens into the future, only one; the next token is default and that is all we know.
But we need to make a decision:
If the default is the beginning of a statement, we can just shift it (item 5). Then when we see the =, we'll reduce default to lvalue
and continue to parse the assign
.
If the default is the beginning of a case, we need to reduce CASE expression COLON statements
to case
(item 2). We will then reduce cases case
to cases
before we finally shift the default. We will then shift the : and continue with DEFAULT COLON statements
.
Like most LR parser generators, PLY resolves shift/reduce conflicts in favour of the shift, so it will always take the first of the two options above. If it then sees : instead of =, it will report a syntax error.
So what we have is just another example of an LR(2) grammar. LR(2) grammars can always be rewritten as LR(1) grammars, but the rewritten grammar is often ugly and bloated. Here is one possible solution, which is possibly less ugly than most.
The body of a switch, using EBNF operators |
, *
and +
(alternation, optional repetition, and repetition) is:
switch-body -> (("case" expression | "default") ":" statement*)+
Or, to make it a little less cumbersome:
case-header -> ("case" expression | "default") ":"
switch-body -> (case-header statement*)+
From the perspective of accepted strings, that's exactly the same as
switch-body -> case-header (case-header | statement)*
In other words, a sequence of things which are either case-header
s or statement
s, where the first one is a case-header
.
This way of writing the rule does not generate the correct parse tree; it simplifies the structure of a switch statement into a soup of statements and case labels. But it does recognise exactly the same language.
On the plus side, it has the virtue of not forcing the parser to decide when a case cause has terminated. (The grammar no longer has case clauses.) So it is a simple LR(1) grammar:
switch : SWITCH LPAREN expression RPAREN LCURLY switch_body RCURLY
switch_body : case_header
| switch_body statement
| switch_body case_header
case_header : CASE expr COLON
| DEFAULT COLON
Now, we could make the argument that the resulting parse tree is, in fact, accurate. Unrealscript shares the same design decision about switch
statements as C, in which a case
clause does not actually define a block in any real sense of the word. It is simply a label which can be jumped to, and a conditional jump to the next label.
But it is actually not particularly complicated to fix the parse tree as we go, because each reduction to switch_body
clearly indicates what we're adding. If we're adding a case-header, we can append a new list to the accumulating list of case clauses; if it's a statement, we append the statement to the end of the last case clause.
So we could write the above rules in PLY roughly as follows:
def p_switch_body_1(p):
''' switch_body : case_header '''
p[0] = [p[1]]
def p_switch_body_2(p):
''' switch_body : switch_body statement '''
# Append the statement to the list which is the last element of
# the tuple which is the last element of the list which is the
# semantic value of symbol 1, the switch_body.
p[1][-1][-1].append(p[2])
p[0] = p[1]
def p_switch_body_3(p):
''' switch_body : switch_body case_header '''
# Add the new case header (symbol 2), whose statement list
# is initially empty, to the list of switch clauses.
p[1].append(p[2])
p[0] = p[1]
def p_case_header_1(p):
''' case_header : CASE expr COLON '''
p[0] = ('switch_case', p[2], [])
def p_case_header_2(p):
''' case_header : DEFAULT COLON '''
p[0] = ('default_case', [])