parsinggrammarcontext-free-grammarbnf

Can this language be described by a non-ambiguous BNF grammar?


I'm getting back into language design/specification (via BNF/EBNF grammars) after 20+ years of leaving the topic alone (since my CS undergrad degree).

I only vaguely recall the various related terms in this space, like LR(1), LALR, etc. I've been attempting to refresh via some googling and reading, but it's coming slowly (probably because I didn't fully understand this stuff back in school). So I'm probably doing things quite roughly.

I decided I would describe a toy language with a grammar, and then try to analyze and possibly optimize it, as a part of my re-learning.

NOTE: all the snippets below can also be found in a gist here.

I started with an EBNF representation (as processed/validated by this tool):

Program                 := WhSp* (StmtSemi WhSp*)* StmtSemiOpt? WhSp*;
Stmt                    := AStmt | BStmt | CStmt | DStmt;
StmtSemi                := Stmt? (WhSp* ";")+;
StmtSemiOpt             := Stmt? (WhSp* ";")*;
WhSp                    := "_";
AStmt                   := "a";
BStmt                   := "b";
CStmt                   := "c";
DStmt                   := "d";

Here are some valid matches for this language (one match per line):


_____
;;;;;
_;_;_
a
__a__
a;
a;b;
a;_b;
_a;_b;_
_a_;_b_;_
__a__;;
_;_a;_b;c;;;__;;__d;___a___

And here are some values that wouldn't be in the language (again, one per line):

ab
a_b
a;_b_c

I then hand-converted this to the following BNF form (as processed/analyzed by this tool):

Program                 -> StmtSemi FinalStmtSemiOpt .
StmtSemi                -> WhSp StmtSemiOpt | StmtSemiOpt .
FinalStmtSemiOpt        -> StmtOpt SemiOpt WhSpOpt | WhSpOpt .

Stmt                    -> AStmt | BStmt | CStmt | DStmt .
StmtOpt                 -> Stmt | .
StmtSemiOpt             -> StmtOpt Semi | StmtOpt Semi WhSpOpt StmtSemiOpt | .

Semi                    -> WhSpOpt ; | WhSpOpt ; Semi .
SemiOpt                 -> Semi | .

WhSp                    -> _ | _ WhSp .
WhSpOpt                 -> WhSp | .

AStmt                   -> a .
BStmt                   -> b .
CStmt                   -> c .
DStmt                   -> d .

That tool's analysis says my grammar is ambiguous. I guess that's not surprising or necessarily a bad outcome, but I know ambiguous grammars limit some kinds of analysis and automatic conversion or parser generation.

So... finally, here are my questions:

  1. Is this a context-free grammar? What specifically makes it so, or would make it non-CFG?

    [Edit: "Yes", see @rici's answer]

  2. Can the language I'm describing be specified in a non-ambiguous grammar (BNF or EBNF)? Or is it just inherently ambiguous?

  3. If it's inherently ambiguous, what specific aspects of the language make it so? In other words, what minimally would I have to change/remove to arrive at a language that had a non-ambiguous grammar?

  4. Are there meaningful ways my BNF grammar form could be simplified and still describe the same language as the EBNF?

  5. Does the BNF grammar currently have left-recursion, right-recursion, or both? I'm having trouble convincing myself of the answer. Could the BNF be re-arranged to avoid one or the other, and what would the impacts (performance, etc) of that be?

    [Edit: I believe the updated BNF only has right-recursion, according to the analysis tool.]

Apologies if I'm fumbling around with incorrect terminology or asking imprecise questions. Thanks for any insight you might be able to offer.


[EDIT: Here's a new BNF that I believe is equivalent but isn't ambiguous -- thanks to @rici for confirming it was possible. I didn't use any particular algorithm/strategy for this, just keep trial-n-error fiddling.]

Leading                 -> WhSp Leading Program | Semi Leading Program | Program .
Program                 -> Stmt | Stmt WhSp | Stmt WhSpOptSemi Program | Stmt WhSpOptSemi WhSp Program | .

Stmt                    -> AStmt | BStmt | CStmt | DStmt .

WhSpOptSemi             -> Semi | WhSp Semi | Semi WhSpOptSemi | WhSp Semi WhSpOptSemi .
WhSp                    -> _ | _ WhSp .
Semi                    -> ; .

AStmt                   -> a .
BStmt                   -> b .
CStmt                   -> c .
DStmt                   -> d .

So that seems to answer questions (2), (3), and partly (4) I guess.


Solution

  • It's not always easy (or even possible) to demonstrate that a grammar is ambiguous, but if there is a short ambiguous sentence, then it can be found with brute-force enumeration, which is what I believe that tool does. And the output is revealing; the shortest ambiguous sentence is the empty string.

    So it remains only to figure out why the empty string can be parsed in two (or more) ways, and a quick look at the productions reveals that FinalStmtSemiOpt has two nullable productions, which means that it has two ways of deriving the empty string. That's evident by inspection, if you believe that every production whose name ends with Opt in fact describes an optional sequence, since FinalStmtSemiOpt has two productions each of which consist only of XOpts. A little more effort is required to verify that the optional non-terminals are, in fact, optional, which is as it should be.

    So the solution is to rework FinalStmtSemiOpt without using two nullable right-hand sides. That shouldn't be too hard.

    Almost all the other questions you raise are answerable by inspection: