Suppose I have a grammar
S -> a b | a T
T -> a
Clearly, the grammar accepts {aa, ab}. But I am confused how LR parsing parses the word "a b". Naively, it could work like this, by reducing the first "a" to T and then it gets stuck or has to backtrack.
a a => T a => stuck or backtrack
How would an LR parser knows that it should NOT reduce the first "a" to T?
In this case, the parser will never try reducing T
because the production T → a
is not in any state reached from S
. The initial state has items:
S → • a b
S → • a T
and the only action possible in this state is a shift action with the token a
. Since a
is, in fact, the next input character, we do a shift transition to a state whose itemset is
S → a • b
S → a • T
T → • a
This state has no reduce actions either, and it has two distinct shift actions: one on b
and the other one on a
. Since the next input is b
, that shift action is taken, leading to a state whose itemset is
S → a b •
in which only a reduction is available.
A slightly more interesting case would be the rather similar grammar
S → a b
S → T a
T → a
Here, the itemset for the initial state does include the production for T
:
S → • a b
S → • T a
T → • a
It's still the case that the only action available in the initial state is to shift a
, but now after doing the shift we find ourselves in a state whose itemset is:
S → a • b
T → a •
and now we have two possible actions: a shift of b
and a reduction of T → a
. Here, the parser needs to use its ability to look one token into the future (assuming that its an LR(1)
parser).
But to let it do so, we need to do a small adjustment. Grammars are always "augmented" before the parsing automaton is constructed. The augmented grammar adds explicit recognition of the end of input by adding a unique end-of-input character which can also participate in lookahead checks. The augmented grammar is:
S'→ S $
S → a b
S → T a
T → a
In this grammar, we can see that the nonterminal T
can only be followed by the symbol a
, and this fact is encoded into the state transition tables, where each item in an itemset is actually annotated with a set of possible lookaheads. (During table construction, all items are annotated, but for the computation of legal actions, only the lookaheads for reductions are considered; a reduction is an item whose • is at the end.)
With this modification, the itemset reached after shifting a
is:
S → a • b
T → a • [ lookahead: { a } ]
and the lookahead makes it absolutely clear which action should be chosen. If the next input is b
, it is shifted. If the next input is a
, T
is reduced.
That precision is not available with LR(0) parsing, in which lookahead is not used. The modified grammar is not LR(0), precisely for the reason you mention: that it cannot decide whether or not to reduce T
. This issues comes up pretty frequently, which is why LR(0) is rarely if ever useful in practice.