For the given context free grammar:
S -> G $
G -> PG | P
P -> id : R
R -> id R | epsilon
How do I rewrite the grammar so that it is LR(1)?
The current grammar has shift/reduce conflicts when parsing the input "id : .id", where "." is the input pointer for the parser.
This grammar produces the language satisfying the regular expression (id:(id)*)+
It's easy enough to produce an LR(1) grammar for the same language. The trick is finding one which has a similar parse tree, or at least from which the original parse tree can be recovered easily.
Here's a manually generated grammar, which is slightly simplified from the general algorithm. In effect, we rewrite the regular expression:
(id:id*)+
to:
id(:id+)*:id*
which induces the grammar:
S → id G $
G → P G | P'
P' → : R'
P → : R
R' → ε | id R'
R → ε | id R
which is LALR(1).
In effect, we've just shifted all the productions one token to the right, and there is a general algorithm which can be used to create an LR(1)
grammar from an LR(k+1)
grammar for any k≥1
. (The version of this algorithm I'm using comes from Parsing Theory by S. Sippu & E. Soisalon-Soininen, Vol II, section 6.7.)
The non-terminals of the new grammar will have the form (x, V, y)
where V
is a symbol from the original grammar (either a terminal or a non-terminal) and x
and y
are terminal sequences of maximum length k
such that:
y ∈ FOLLOWk(V)
x ∈ FIRSTk(Vy)
(The lengths of y
and consequently x
might be less than k
if the end of input is included in the follow set. Some people avoid this issue by adding k
end symbols, but I think this version is just as simple.)
A non-terminal (x, V, y)
will generate the x
-derivative of the strings derived from Vy
from the original grammar. Informally, the entire grammar is shifted k
tokens to the right; each non-terminal matches a string which is missing the first k
tokens but is augmented with the following k
tokens.
The productions are generated mechanically from the original productions. First, we add a new start symbol, S'
with productions:
S' → x (x, S, ε)
for every x ∈ FIRSTk(S)
. Then, for every production
T → V0 V1 … Vm
we generate the set of productions:
(x0,T,xm+1) → (x0,V0,x1) (x1,V1,x2) … (xm,Vm,xm+1)
and for every terminal A
we generate the set of productions
(Ax,A,xB) → B if |x| = k
(Ax,A,x) → ε if |x| ≤ k
Since there is an obvious homomorphism from the productions in the new grammar to the productions in the old grammar, we can directly create the original parse tree, although we need to play some tricks with the semantic values in order to correctly attach them to the parse tree.