I'm trying to implement a parser generator (for fun), following the book Engineering a Compiler. The book suggests left-factoring for eliminating backtracking:
A -> B m | B n | m
B -> m
Becomes:
A -> B A' | m
B -> m
A' -> m | n
But now, B
starts with m
, A'
starts with m
, and A
has a production alternative that also starts with m
, hence the start sets for A
, B
and A'
have the common element m
, disqualifying the grammar for a LL(1) parser.
Is the grammar above inherently a non-backtrack free grammar and requires a more powerful parser such as a LR(1) parser? Or am I applying the left-factoring wrong? or there needs to be a prior step to left factoring so that no rule starts with a non-terminal (is it even possible?)
I feel the book is missing some description, in lot of places and this is one example.
In the above grammar, I simplified the issue I'm facing with the following toy grammar:
start -> fn_declaration start | fn_declaration |
fn_call -> ID ( args ) ;
args -> args , arg | arg |
arg -> STR | INT | ID
fn_declaration -> FN ID ( params ) { statements }
params -> param , params | param |
param -> ID ID
statements -> statement , statements | statement |
statement -> declaration | assignment | fn_call | ret
declaration -> ID ID ;
assignment -> ID = expressions ;
expressions -> terms + expressions | terms - expressions | terms
terms -> factor * terms | factor / terms | factor
factor -> ( expressions ) | INT | ID
ret -> RETURN expressions ;
To which I'm applying these steps, in this order:
All steps do their job, but I eventually have these rules:
statement => alt=[ declaration | assignment | ret | Id ( statement_p0 ], first=[Id, return, ;], follow=[,, }]
statement_p0 => alts=[ args ) ; | ) ;], first=[), Int, Str, Id], follow=[,, }]
# ...rest of the rules removed for brevity
declaration
and assignment
share ID
as their first token kind, and fail the is-backtrack-free test.
In order to left-factor the grammar, you need to ensure that every symbol that begins the RHS of a rule for some non-terminal has a disjoint FIRST set from every other symbol. The easiest way to do that is to first left-expand the grammar -- expand every initial non-terminal symbol on a RHS so that the RHS of every rule begins with a terminal.
When you do that you get
A -> m m | m n | m
B -> m
so now you left-factor the A rules to
A -> m A'
A' -> m | n | ε