I'm getting back into language design/specification (via BNF/EBNF grammars) after 20+ years of leaving the topic alone (since my CS undergrad degree).
I only vaguely recall the various related terms in this space, like LR(1), LALR, etc. I've been attempting to refresh via some googling and reading, but it's coming slowly (probably because I didn't fully understand this stuff back in school). So I'm probably doing things quite roughly.
I decided I would describe a toy language with a grammar, and then try to analyze and possibly optimize it, as a part of my re-learning.
NOTE: all the snippets below can also be found in a gist here.
I started with an EBNF representation (as processed/validated by this tool):
Program := WhSp* (StmtSemi WhSp*)* StmtSemiOpt? WhSp*;
Stmt := AStmt | BStmt | CStmt | DStmt;
StmtSemi := Stmt? (WhSp* ";")+;
StmtSemiOpt := Stmt? (WhSp* ";")*;
WhSp := "_";
AStmt := "a";
BStmt := "b";
CStmt := "c";
DStmt := "d";
Here are some valid matches for this language (one match per line):
_____
;;;;;
_;_;_
a
__a__
a;
a;b;
a;_b;
_a;_b;_
_a_;_b_;_
__a__;;
_;_a;_b;c;;;__;;__d;___a___
And here are some values that wouldn't be in the language (again, one per line):
ab
a_b
a;_b_c
I then hand-converted this to the following BNF form (as processed/analyzed by this tool):
Program -> StmtSemi FinalStmtSemiOpt .
StmtSemi -> WhSp StmtSemiOpt | StmtSemiOpt .
FinalStmtSemiOpt -> StmtOpt SemiOpt WhSpOpt | WhSpOpt .
Stmt -> AStmt | BStmt | CStmt | DStmt .
StmtOpt -> Stmt | .
StmtSemiOpt -> StmtOpt Semi | StmtOpt Semi WhSpOpt StmtSemiOpt | .
Semi -> WhSpOpt ; | WhSpOpt ; Semi .
SemiOpt -> Semi | .
WhSp -> _ | _ WhSp .
WhSpOpt -> WhSp | .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
That tool's analysis says my grammar is ambiguous. I guess that's not surprising or necessarily a bad outcome, but I know ambiguous grammars limit some kinds of analysis and automatic conversion or parser generation.
So... finally, here are my questions:
Is this a context-free grammar? What specifically makes it so, or would make it non-CFG?
[Edit: "Yes", see @rici's answer]
Can the language I'm describing be specified in a non-ambiguous grammar (BNF or EBNF)? Or is it just inherently ambiguous?
If it's inherently ambiguous, what specific aspects of the language make it so? In other words, what minimally would I have to change/remove to arrive at a language that had a non-ambiguous grammar?
Are there meaningful ways my BNF grammar form could be simplified and still describe the same language as the EBNF?
Does the BNF grammar currently have left-recursion, right-recursion, or both? I'm having trouble convincing myself of the answer. Could the BNF be re-arranged to avoid one or the other, and what would the impacts (performance, etc) of that be?
[Edit: I believe the updated BNF only has right-recursion, according to the analysis tool.]
Apologies if I'm fumbling around with incorrect terminology or asking imprecise questions. Thanks for any insight you might be able to offer.
[EDIT: Here's a new BNF that I believe is equivalent but isn't ambiguous -- thanks to @rici for confirming it was possible. I didn't use any particular algorithm/strategy for this, just keep trial-n-error fiddling.]
Leading -> WhSp Leading Program | Semi Leading Program | Program .
Program -> Stmt | Stmt WhSp | Stmt WhSpOptSemi Program | Stmt WhSpOptSemi WhSp Program | .
Stmt -> AStmt | BStmt | CStmt | DStmt .
WhSpOptSemi -> Semi | WhSp Semi | Semi WhSpOptSemi | WhSp Semi WhSpOptSemi .
WhSp -> _ | _ WhSp .
Semi -> ; .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
So that seems to answer questions (2), (3), and partly (4) I guess.
It's not always easy (or even possible) to demonstrate that a grammar is ambiguous, but if there is a short ambiguous sentence, then it can be found with brute-force enumeration, which is what I believe that tool does. And the output is revealing; the shortest ambiguous sentence is the empty string.
So it remains only to figure out why the empty string can be parsed in two (or more) ways, and a quick look at the productions reveals that FinalStmtSemiOpt
has two nullable productions, which means that it has two ways of deriving the empty string. That's evident by inspection, if you believe that every production whose name ends with Opt
in fact describes an optional sequence, since FinalStmtSemiOpt
has two productions each of which consist only of XOpt
s. A little more effort is required to verify that the optional non-terminals are, in fact, optional, which is as it should be.
So the solution is to rework FinalStmtSemiOpt
without using two nullable right-hand sides. That shouldn't be too hard.
Almost all the other questions you raise are answerable by inspection:
A grammar is context-free (by definition) iff every production has precisely one symbol on the left hand side. (If it had more than one symbol, then the extra symbols define a context in which the substitution is valid. If no production has a context, the grammar is context-free.)
A non-terminal is left-recursive if some production for the non-terminal either starts with the non-terminal itself (direct left-recursion), or starts with a sequence of nullable non-terminals followed by the non-terminal itself (indirect left-recursion). In other words, the non-terminal is or could be the left-most symbol in a production. Similarly, a non-terminal is right-recursive if some production ends with the non-terminal (again, possibly followed by nullable non-terminals).
There is no algorithm which can tell you in general if a language is inherently ambiguous, but in the particular case that a language is regular, then it is definitely not inherently ambiguous. A language with a regular grammar is regular, but a non-regular grammar could still describe a regular language. Your grammar is not regular, but it can be made regular without too much work (mostly refactoring). A hint that it is probably regular is that there is no recursively nested syntax (eg. parentheses or statement blocks). Most useful grammars are recursively nested, and nesting in no way implies ambiguity. So that doesn't take you too far, but it happens to be the case here.