pythonplylalrunrealscript

Resolving shift/reduce conflict for default label in switch block


I'm writing a parser for Unrealscript using PLY, and I've run into (hopefully) one of the last ambiguities in the parsing rules that I've set up.

Unrealscript has a keyword, default, which is used differently depending on the context. In a regular statement line, you could use default like so:

default.SomeValue = 3;  // sets the "default" class property to 3

There is also, of course, the default case label for switch statements:

switch (i) {
    case 0:
        break;
    default:
        break;
}

There is an ambiguity in the parsing when it encounters the default label within what it thinks is the case's statement block. Here is an example file that runs into a parsing error:

Input

class Example extends Object;

function Test() {
    switch (A) {
        case 0:
            default.SomeValue = 3;    // OK
        default:                      // ERROR: Expected "default.IDENTIFIER"
            break;
    }
}

Parsing Rules

Here are the rules that are in conflict:

All of the rules can be seen in their entirety on GitHub.

default

def p_default(p):
    'default : DEFAULT PERIOD identifier'
    p[0] = ('default', p[3])

switch

def p_switch_statement(p):
    'switch_statement : SWITCH LPAREN expression RPAREN LCURLY switch_cases RCURLY'
    p[0] = ('switch_statement', p[3], p[6])


def p_switch_case_1(p):
    'switch_case : CASE primary COLON statements_or_empty'
    p[0] = ('switch_case', p[2], p[4])


def p_switch_case_2(p):
    'switch_case : DEFAULT COLON statements_or_empty'
    p[0] = ('default_case', p[3])


def p_switch_cases_1(p):
    'switch_cases : switch_cases switch_case'
    p[0] = p[1] + [p[2]]


def p_switch_cases_2(p):
    'switch_cases : switch_case'
    p[0] = [p[1]]


def p_switch_cases_or_empty(p):
    '''switch_cases_or_empty : switch_cases
                             | empty'''
    p[0] = p[1]

Any help or guidance on how to resolve this conflict will be greatly appreciated! Thank you in advance.


Solution

  • What you have here is a simple shift/reduce conflict (with the token default as lookahead) being resolved as a shift.

    Let's reduce this all to a much smaller, if not minimal example. Here's the grammar, partly based on the one in the Github repository pointed to in the OP (but intended to be self-contained):

    statements: statements statement |
    statement : assign SEMICOLON
              | switch
    assign    : lvalue EQUALS expression
    switch    : SWITCH LPAREN expression RPAREN LCURLY cases RCURLY
    cases     : cases case | 
    case      : CASE expression COLON statements
              | DEFAULT COLON statements
    expression: ID | INT
    lvalue    : ID | DEFAULT
    

    The key here is that a statement might start with the token DEFAULT, and a case might also start with the token DEFAULT. Now, suppose we've reached the following point in the parse:

    switch ( <expression> ) { <cases> case <expression> : <statements>
    

    so we're in the middle of a switch compound statement; we've seen case 0: and we're working on a list of statements. The current state includes the items (there are a few more; I only include the relevant ones):

    1. statements: statements · statement
    2. case      : CASE expression COLON statements ·
    3. statement : · assign SEMICOLON
    4. assign    : · lvalue EQUALS expression
    5. lvalue    : · DEFAULT
    

    The lookahead set for item 2 is [ RCURLY, DEFAULT, ID ].

    Now suppose the next token is default. We could be looking at the start of a statement, if the default is followed by =. Or we could be looking at a new case clause, if the default is followed by :. But we can't see two tokens into the future, only one; the next token is default and that is all we know.

    But we need to make a decision:

    Like most LR parser generators, PLY resolves shift/reduce conflicts in favour of the shift, so it will always take the first of the two options above. If it then sees : instead of =, it will report a syntax error.

    So what we have is just another example of an LR(2) grammar. LR(2) grammars can always be rewritten as LR(1) grammars, but the rewritten grammar is often ugly and bloated. Here is one possible solution, which is possibly less ugly than most.

    The body of a switch, using EBNF operators |, * and + (alternation, optional repetition, and repetition) is:

    switch-body -> (("case" expression | "default") ":" statement*)+
    

    Or, to make it a little less cumbersome:

    case-header -> ("case" expression | "default") ":"
    switch-body -> (case-header statement*)+
    

    From the perspective of accepted strings, that's exactly the same as

    switch-body -> case-header (case-header | statement)*
    

    In other words, a sequence of things which are either case-headers or statements, where the first one is a case-header.

    This way of writing the rule does not generate the correct parse tree; it simplifies the structure of a switch statement into a soup of statements and case labels. But it does recognise exactly the same language.

    On the plus side, it has the virtue of not forcing the parser to decide when a case cause has terminated. (The grammar no longer has case clauses.) So it is a simple LR(1) grammar:

    switch       : SWITCH LPAREN expression RPAREN LCURLY switch_body RCURLY
    switch_body  : case_header
                 | switch_body statement
                 | switch_body case_header
    case_header  : CASE expr COLON
                 | DEFAULT COLON
    

    Now, we could make the argument that the resulting parse tree is, in fact, accurate. Unrealscript shares the same design decision about switch statements as C, in which a case clause does not actually define a block in any real sense of the word. It is simply a label which can be jumped to, and a conditional jump to the next label.

    But it is actually not particularly complicated to fix the parse tree as we go, because each reduction to switch_body clearly indicates what we're adding. If we're adding a case-header, we can append a new list to the accumulating list of case clauses; if it's a statement, we append the statement to the end of the last case clause.

    So we could write the above rules in PLY roughly as follows:

    def p_switch_body_1(p):
        ''' switch_body  : case_header '''
        p[0] = [p[1]]
    
    def p_switch_body_2(p):
        ''' switch_body  : switch_body statement '''
        # Append the statement to the list which is the last element of
        # the tuple which is the last element of the list which is the
        # semantic value of symbol 1, the switch_body.
        p[1][-1][-1].append(p[2])
        p[0] = p[1]
    
    def p_switch_body_3(p):
        ''' switch_body  : switch_body case_header '''
        # Add the new case header (symbol 2), whose statement list
        # is initially empty, to the list of switch clauses.
        p[1].append(p[2])
        p[0] = p[1]
    
    def p_case_header_1(p):
        ''' case_header  : CASE expr COLON '''
        p[0] = ('switch_case', p[2], [])
    
    def p_case_header_2(p):
        ''' case_header  : DEFAULT COLON '''
        p[0] = ('default_case', [])