I've been trying to calculate the follow set of a grammar for some time now, and have run into yet another problem. Here is my follow set calculator:
def gen_follow_set(grammar, start_sym, first_sets):
follow_sets = {nterm: set() for nterm in grammar}
follow_sets[start_sym].add("$")
for _, prods in grammar.items():
for alt in prods:
for item in alt:
if item.isupper():
follow_sets[item] = set()
while True:
changes = copy.deepcopy(follow_sets)
for nterm, prods in grammar.items():
for alt in prods:
for i, item in enumerate(alt):
la = alt[i + 1] if i + 1 != len(alt) else nterm
if i == len(alt) - 1 and item != "":
follow_sets[item] |= follow_sets[nterm]
elif item != "":
if "" in first_sets[la]:
follow_sets[item] |= first_sets[la].union(
first_sets[alt[i + 2] if i + 2 <= len(alt) -
1 else nterm]) - {""}
else:
follow_sets[item] |= first_sets[la]
if changes == follow_sets:
return follow_sets
This is called like this:
grammar = {
"expr": [["term", "etail"]],
"term": [["LPAREN", "expr", "RPAREN"], ["INT", "ttail"]],
"etail": [["PLUS", "expr"], [""]],
"ttail": [["TIMES", "term"], [""]]
}
first = calc_first_set(...)
pprint.pprint(gen_follow_set(grammar, "expr", first))
This outputs:
Working on: term ; la has epsilon; la: etail
Working on: etail ; it is at the end of the production: expr
Working on: LPAREN ; la doesn't have epsilon; la: expr
Working on: expr ; la doesn't have epsilon; la: RPAREN
Working on: RPAREN ; it is at the end of the production: term
Working on: INT ; la has epsilon; la: ttail
Working on: ttail ; it is at the end of the production: term
Working on: PLUS ; la doesn't have epsilon; la: expr
Working on: expr ; it is at the end of the production: etail
Working on: TIMES ; la doesn't have epsilon; la: term
Working on: term ; it is at the end of the production: ttail
Working on: term ; la has epsilon; la: etail
Working on: etail ; it is at the end of the production: expr
Working on: LPAREN ; la doesn't have epsilon; la: expr
Working on: expr ; la doesn't have epsilon; la: RPAREN
Working on: RPAREN ; it is at the end of the production: term
Working on: INT ; la has epsilon; la: ttail
Working on: ttail ; it is at the end of the production: term
Working on: PLUS ; la doesn't have epsilon; la: expr
Working on: expr ; it is at the end of the production: etail
Working on: TIMES ; la doesn't have epsilon; la: term
Working on: term ; it is at the end of the production: ttail
Working on: term ; la has epsilon; la: etail
Working on: etail ; it is at the end of the production: expr
Working on: LPAREN ; la doesn't have epsilon; la: expr
Working on: expr ; la doesn't have epsilon; la: RPAREN
Working on: RPAREN ; it is at the end of the production: term
Working on: INT ; la has epsilon; la: ttail
Working on: ttail ; it is at the end of the production: term
Working on: PLUS ; la doesn't have epsilon; la: expr
Working on: expr ; it is at the end of the production: etail
Working on: TIMES ; la doesn't have epsilon; la: term
Working on: term ; it is at the end of the production: ttail
{'INT': {'INT', 'TIMES', 'LPAREN'},
'LPAREN': {'INT', 'LPAREN'},
'PLUS': {'INT', 'LPAREN'},
'RPAREN': {'INT', 'LPAREN', 'PLUS'},
'TIMES': {'INT', 'LPAREN'},
'etail': {'$', 'RPAREN'},
'expr': {'$', 'RPAREN'},
'term': {'INT', 'LPAREN', 'PLUS'},
'ttail': {'INT', 'LPAREN', 'PLUS'}}
etail
and expr
are correct, but term
and ttail
aren't correct. How can I make the get the correct answer?
Whenever the non-terminal N
appears in a production
M → α N β
We have
FIRST(α) ⊂ FOLLOW(N)
If β
is nullable, then FOLLOW(M) ⊂ FOLLOW(N)
Your code works correctly if β is empty (i.e. N
is at the end of the production) or if the first symbol in β is not nullable. In the remaining cases, your code has errors:
If the first symbol in β is nullable, you compute FIRST(β) as the union of the FIRST sets of the first two symbols in β. Since you never check whether the second (or subsequent) symbols are nullable, you might miss symbols in FIRST(β).
Another result of only testing nullability of the next symbol is that you don't compute NULLABLE(β); instead you use the nullability of the first symbol in β. So you might miss symbols in FOLLOW(M)
.
I don't believe either of those bugs are triggered by your actual grammar. But the next one is;
In the case that your (insufficient) test reveals that β is nullable, you use FIRST(M)
instead of FOLLOW(M)
.
A closely related problem is the computation of la
which proposes term
as the next symbol if the end of the production has been reached. That would lead to using FIRST(term)
rather than FOLLOW(term)
, but of course that ca n never happen since the only code branch which uses la
does not execute if N
is at the end of the production. That being the case, la
is actually unnecessary.