algorithmparsinglr-grammar

What is an LR(2) parser? How does it differ from an LR(1) parser?


I'm familiar with LR(1) parsers, which are usually taught in traditional compilers courses. I know that LR(2) parsers exist, but I've never seen one constructed before.

How is an LR(2) parser constructed? How does it differ from an LR(1) parser?


Solution

  • In many ways, an LR(2) parser works similarly to an LR(1) parser. It traces out a rightmost derivation in reverse, maintains a stack, executes shift and reduce actions on a stack, has states that consist of sets of LR items, etc. There are, however, a few major differences:

    To motivate how this works, let's take an example of an LR(2) grammar. (This grammar is derived from one mentioned in @rici's excellent answer to this earlier question).

    S → RS | R

    R → abT

    T → aT | c | ε

    To build our LR(2) parser for this grammar, we'll begin, as usual, by augmenting the grammar with a production of the form S' → S:

    S' → S

    S → RS | R

    R → abT

    T → aT | c | ε

    Now, we start generating configurating sets. We begin, as with an LR(1) parser, with the production S' → S. This is shown here:

    (1)
        S' -> .S  [$$]
    

    Notice here that the lookahead is $$, indicating two copies of the "end-of-stream" marker. In a traditional LR(1) (or SLR(1), or LALR(1)) parser, we'd have a lookahead here of $, just one copy of the end-of-stream marker, because in those parsers the token stream is suffixed with a single $ marker. But now that we’re in LR(2) Land, we have two tokens of lookahead, so our token stream will be sufficed with two end-of-stream markers.

    We now start expanding out the other items in this configurating set. Since we have productions S → RS and S → R, we add these items:

    (1)
        S' -> .S  [$$]
        S  -> .R  [$$]  // New
        S  -> .RS [$$]  // New
    

    Now, let's start chasing down what happens next. As in an LR(1) parser, since there are dots before the nonterminal R's here, we need to expand them out. As in LR(1) parsing, as we do so, we'll need to determine what lookaheads to use. We'll begin by expanding out the S -> .R [$$] item, like this:

    (1)
        S' -> .S   [$$]
        S  -> .R   [$$]
        S  -> .RS  [$$]
        R  -> .abT [$$]  // New
    

    Next, let's expand out the S -> .RS [$$] option. This is an interesting case. We need to determine what the lookahead will be for the R productions discovered here. In an LR(1) parser, this is found by looking at the FIRST set of the remainder of the production. In an LR(2) parser, because we have two tokens of lookahead, we have to look a the FIRST2 set, which is the generalization of FIRST sets that lists the strings of length two that can appear at the front of the production, rather than the strings of length one that can appear there. In our case, FIRST2(S) = {ab} (do you see why?), so we have the following:

    (1)
        S' -> .S   [$$]
        S  -> .R   [$$]
        S  -> .RS  [$$]
        R  -> .abT [$$]
        R  -> .abT [ab]  // New
    

    At this point, we've finished expanding out the first configurating set. It's now time to think about what we would do if we saw different characters next. Fortunately, in this case that's fairly easy, since the first character of any string produced by this grammar has to be an a. So let's see what happens if we encounter an a:

    (2)
        R  -> a.bT [$$]
        R  -> a.bT [ab]
    

    Seems fine so far. Now what happens if we see a b here? That would take us to this spot:

    (3)
        R  -> ab.T [$$]
        R  -> ab.T [ab]
    

    There are two LR(2) items here that have dots before a nonterminal, so we need to expand these out. Let's begin by expanding these for R -> ab.T [$$], giving the following:

    (3)
        R  -> ab.T [$$]
        R  -> ab.T [ab]
        T  -> .aT  [$$]  // New
        T  -> .c   [$$]  // New
        T  -> .    [$$]  // New
    

    Next, we'll expand out the production for R -> ab.T [ab]:

    (3)
        R  -> ab.T [$$]
        R  -> ab.T [ab]
        T  -> .aT  [$$]
        T  -> .c   [$$]
        T  -> .    [$$]
        T  -> .aT  [ab] // New
        T  -> .c   [ab] // New
        T  -> .    [ab] // New
    

    That fills out this configurating set. This is the first time we've found some reduce items that are complete (here, T -> . with the two different lookaheads). We also have some shift items here. So we have to ask - do we have a shift/reduce conflict or a reduce/reduce conflict here?

    Let's begin with reduce/reduce conflicts. As is the case in LR(1) parsing, we have a reduce/reduce conflict when there are two different reduce items (items with the dot at the end) with the same lookaheads. Here, we have two different reduce items, but they have different lookaheads. That means we're fine on the reduce/reduce front.

    Now, the interesting case. Do we have any shift/reduce conflicts here? This is where things change a bit from LR(1) parsing. As is the case in LR(1) parsing, we look at all the shift items in our set (items with a dot before a terminal) and all the reduce items in our set (items with the dot at the end). We're looking to see if there's any conflicts:

        T  -> .aT  [$$] // Shift
        T  -> .c   [$$] // Shift
        T  -> .    [$$] // Reduce
        T  -> .aT  [ab] // Shift
        T  -> .c   [ab] // Shift
        T  -> .    [ab] // Reduce
    

    The question, though, is what a shift/reduce conflict looks like here. In an LR(2) parser, we have two tokens of lookahead on which we base our decision of whether to shift or reduce. In the case of the reduce items, it's easy to see what two tokens of lookahead will lead us to reduce - it's the two-character lookahead in the brackets. On the other hand, consider the shift item T -> .c [ab]. What is the two-character lookahead here in which we'd shift? In the case of an LR(1) parser, we'd just say "oh, the dot's before the c, so we shift on c," but that's not sufficient here. Instead, we'd say that the lookahead associated with this shift item is ca, with the c coming from the production itself and the a coming from the first character of the item's lookahead.

    Similarly, consider the shift item T -> .aT [$$]. We need two characters of lookahead, and we can see one of them pretty easily (the a after the dot). To get the second, we have to look at what T is capable of producing. There are three productions for T: one that replaces T with ε, one that replaces T with aT, and one that replaces T with c. This means that that any string that can be derived from T starts either with a or with c, or is the empty string. As a result, the item T -> .aT [$$] tells us to shift when seeing either ac or aa (what we'd get from the a and the c), or on a$ (what we'd get if we used the production T → ε, then picked up one of the $ characters from the normal lookahead.

    More generally, following this same general procedure - using the terminal(s) after the dot, the terminals in the lookahead set for the item, and the characters that can appear at the front of any strings derivable from future nonterminals - we can compute the two-character lookaheads that we use to determine when to shift. In particular, we're left with this:

    (3)
        R  -> ab.T [$$]
        R  -> ab.T [ab]
        T  -> .aT  [$$] // Shift  on aa, ac, a$
        T  -> .c   [$$] // Shift  on c$
        T  -> .    [$$] // Reduce on $$
        T  -> .aT  [ab] // Shift  on aa, ac
        T  -> .c   [ab] // Shift  on ca
        T  -> .    [ab] // Reduce on ab
    

    Fortunately, there are no shift/reduce conflicts here, because each two-character lookahead tells us to do something different.

    Looking past the dot to determine when to shift is something new to LR(2) parsing that appears in LR(k > 1) parsing but not LR(1) parsing. In LR(1) parsing, we just need to look at the terminal past the dot. In LR(2) parsing, since we need more characters to determine what to do, we have to compute a secondary dot lookahead for each shift item. Specifically, the dot lookahead is determined as follows:

    In a production A → α.tω [γ], where t is a terminal, the dot lookahead is the set of all strings of length 2 that can appear at the start of a string derivable from tωγ. Stated differently, a production A → α.tω [γ] has dot lookahead equal to FIRST2(tωγ).

    With all of this in mind, we can build the full LR(2) parser and describe the actions associated with each state. The overall LR(2) parser looks like this:

    (1)
        S' -> .S   [$$]  // Go to 10
        S  -> .R   [$$]  // Go to 8
        S  -> .RS  [$$]  // Go to 8
        R  -> .abT [$$]  // Shift  on ab, go to (2)
        R  -> .abT [ab]  // Shift  on ab, go to (2)
    
    (2)
        R  -> a.bT [$$]  // Shift  on ba, bc, b$, go to (3)
        R  -> a.bT [ab]  // Shift  on ba, bc,     go to (3)
    
    (3)
        R  -> ab.T [$$] // Go to 7
        R  -> ab.T [ab] // Go to 7
        T  -> .aT  [$$] // Shift  on aa, ac, a$, go to (4)
        T  -> .c   [$$] // Shift  on c$,         go to (5)
        T  -> .    [$$] // Reduce on $$
        T  -> .aT  [ab] // Shift  on aa, ac,     go to (4)
        T  -> .c   [ab] // Shift  on ca,         go to (5)
        T  -> .    [ab] // Reduce on ab
    
    (4)
        T  -> a.T  [$$] // Go to 6
        T  -> a.T  [ab] // Go to 6
        T  -> .    [$$] // Reduce on $$
        T  -> .aT  [$$] // Shift  on aa, ac, a$, go to (4)
        T  -> .c   [$$] // Shift  on c$,         go to (5)
        T  -> .    [ab] // Reduce on ab
        T  -> .aT  [ab] // Shift  on aa, ac,     go to (4)
        T  -> .c   [ab] // Shift  on ca,         go to (5)
    
    (5)
        T  -> c.   [$$] // Reduce on $$
        T  -> c.   [ab] // Reduce on ab
    
    (6)
        T  -> aT.  [$$] // Reduce on $$ 
        T  -> aT.  [ab] // Reduce on ab
    
    (7)
        R  -> abT. [$$] // Reduce on $$
        R  -> abT. [ab] // Reduce on ab
    
    (8)
        S  -> R.   [$$] // Reduce on $$
        S  -> R.S  [$$] // Go to 9
        S  -> .RS  [$$] // Go to 8
        S  -> .R   [$$] // Go to 8
        R  -> .abT [$$] // Shift  on ab, go to (2)
        R  -> .abT [ab] // Shift  on ab, go to (2)
    
    (9)
        S  -> RS.  [$$] // Reduce on $$
    
    (10)
        S' -> S.   [$$] // Accept on $$
    

    So now we have our LR(2) parser for this grammar. All that's left to do now is to encode this as an action and goto table, along the lines of what we do for an LR(1) parser.

    As you might expect, the action table for an LR(2) parser differs from that of the action table for an LR(1) parser in that it's keyed by lookaheads given by two characters rather than just one character. This means that the action table will be much larger for an LR(2) parser than an LR(1) parser. Here's what that looks like here:

     state | aa | ab | ac | a$ | ba | bb | bc | b$ | ca | cb | cc | c$ | $$ 
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       1   |    | S  |    |    |    |    |    |    |    |    |    |    |
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       2   |    |    |    |    | S  |    | S  | S  |    |    |    |    |
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       3   | S  | R  | S  | S  |    |    |    |    | S  |    |    | S  | R
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       4   | S  | R  | S  | S  |    |    |    |    | S  |    |    | S  | R
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       5   |    | R  |    |    |    |    |    |    |    |    |    |    | R
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       6   |    | R  |    |    |    |    |    |    |    |    |    |    | R
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       7   |    | R  |    |    |    |    |    |    |    |    |    |    | R
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       8   |    | S  |    |    |    |    |    |    |    |    |    |    | R
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       9   |    |    |    |    |    |    |    |    |    |    |    |    | R
     ------+----+----+----+----+----+----+----+----+----+----+----+----+----
       10  |    |    |    |    |    |    |    |    |    |    |    |    | A
    

    As you can see, each entry here just says whether to shift or reduce. In practice, each reduce item would be tagged with which production you'd actually do the reduction for, but, um, I was too lazy to type that out.

    In an LR(1) parsing table, you'd typically combine this table with the "goto" table saying where to go after seeing each symbol. That's due to a fortuitous coincidence. In an LR(1) parser, the size of the lookahead is 1, which happens to align with the fact that the goto table says where you're supposed to transition to after seeing the next character. In an LR(2) parser, the decision about whether to shift or reduce depends on two characters of lookahead, but the choice of where to go next after reading one more character of the input only depends on a single character. That is, even though you have two tokens of lookahead to decide whether to do, you only shift one character over at a time. This means that the goto table for an LR(2) parser looks a lot like the goto table for an LR(0) or LR(1) parser, and is shown here:

     state |  a  |  b  |  c  |  $  |  S  |  R  |  T
    -------+-----+-----+-----+-----+-----+-----+-----
       1   |  2  |     |     |     |  10 |  8  |
    -------+-----+-----+-----+-----+-----+-----+-----
       2   |     |  3  |     |     |     |     |
    -------+-----+-----+-----+-----+-----+-----+-----
       3   |  4  |     |  5  |     |     |     |  7
    -------+-----+-----+-----+-----+-----+-----+-----
       4   |  4  |     |  5  |     |     |     |  6
    -------+-----+-----+-----+-----+-----+-----+-----
       5   |     |     |     |     |     |     |
    -------+-----+-----+-----+-----+-----+-----+-----
       6   |     |     |     |     |     |     |
    -------+-----+-----+-----+-----+-----+-----+-----
       7   |     |     |     |     |     |     |
    -------+-----+-----+-----+-----+-----+-----+-----
       8   |  2  |     |     |     |  9  |  8  |
    -------+-----+-----+-----+-----+-----+-----+-----
       9   |     |     |     |     |     |     |
    -------+-----+-----+-----+-----+-----+-----+-----
       10  |     |     |     |     |     |     |
    

    So, to summarize:

    Interestingly, once you know how to build an LR(2) parser, you can generalize the idea to build an LR(k) parser for any k ≥ 2. In particular, all the "new surprises" that arose here are the same ones that you'll see for larger and larger values of k.

    In practice, LR(2) parsers are rarely used due to the size of their action tables and the fact that they generally have way more states than an LR(1) parser due to the increased lookaheads. But it's still worthwhile, IMHO, to see how to they work. It gives you a sense of how LR parsing works more generally.

    Hope this helps!


    A huge thanks to @AProgrammer's answer on cs.stackexchange.com about dot lookaheads versus item lookaheads, which helped me get a better sense of how dot lookaheads function and what their purpose is!

    If you'd like to read the original paper on LR(k) parsing, which specs out the full rules for LR(k) parsing in full detail, check out "On the Translation of Languages from Left to Right" by Don Knuth.