parsinggrammarcontext-free-grammarlr-grammar

How does LR parsing parse the word "a b" in this grammar: S -> a b | a T ; T -> a


Suppose I have a grammar

S -> a b | a T
T -> a

Clearly, the grammar accepts {aa, ab}. But I am confused how LR parsing parses the word "a b". Naively, it could work like this, by reducing the first "a" to T and then it gets stuck or has to backtrack.

a a => T a => stuck or backtrack

How would an LR parser knows that it should NOT reduce the first "a" to T?


Solution

  • In this case, the parser will never try reducing T because the production T → a is not in any state reached from S. The initial state has items:

    S → • a b
    S → • a T
    

    and the only action possible in this state is a shift action with the token a. Since a is, in fact, the next input character, we do a shift transition to a state whose itemset is

    S → a • b
    S → a • T
    T → • a
    

    This state has no reduce actions either, and it has two distinct shift actions: one on b and the other one on a. Since the next input is b, that shift action is taken, leading to a state whose itemset is

    S → a b •
    

    in which only a reduction is available.

    A slightly more interesting case would be the rather similar grammar

    S → a b
    S → T a
    T → a
    

    Here, the itemset for the initial state does include the production for T:

    S → • a b
    S → • T a
    T → • a
    

    It's still the case that the only action available in the initial state is to shift a, but now after doing the shift we find ourselves in a state whose itemset is:

    S → a • b
    T → a •
    

    and now we have two possible actions: a shift of b and a reduction of T → a. Here, the parser needs to use its ability to look one token into the future (assuming that its an LR(1) parser).

    But to let it do so, we need to do a small adjustment. Grammars are always "augmented" before the parsing automaton is constructed. The augmented grammar adds explicit recognition of the end of input by adding a unique end-of-input character which can also participate in lookahead checks. The augmented grammar is:

    S'→ S $
    S → a b
    S → T a
    T → a
    

    In this grammar, we can see that the nonterminal T can only be followed by the symbol a, and this fact is encoded into the state transition tables, where each item in an itemset is actually annotated with a set of possible lookaheads. (During table construction, all items are annotated, but for the computation of legal actions, only the lookaheads for reductions are considered; a reduction is an item whose • is at the end.)

    With this modification, the itemset reached after shifting a is:

    S → a • b
    T → a •    [ lookahead: { a } ]
    

    and the lookahead makes it absolutely clear which action should be chosen. If the next input is b, it is shifted. If the next input is a, T is reduced.

    That precision is not available with LR(0) parsing, in which lookahead is not used. The modified grammar is not LR(0), precisely for the reason you mention: that it cannot decide whether or not to reduce T. This issues comes up pretty frequently, which is why LR(0) is rarely if ever useful in practice.