grammarcontext-free-grammarbnfebnfcontext-sensitive-grammar

How to deal with large alphabets in BNF?


Given a language defined as:

Any pair of matching symbols is a valid string.

E.g. 00, 55, qq, YY

And a large alphabet of non-terminal symbols (let's say 4,294,967,296 of them)...

How would you define a BNF grammar to express the language? (Context-sensitive or otherwise.)


I'm specifically interested in learning if there's a way to do this without writing 4,294,967,296 rules: i.e. a grammar that's so large, it's lost all the benefits of being defined with BNF, as it has become a "brute force" set of valid literals.



Solution

  • Most uses of BNF are to describe context-free grammars.

    You can certainly use the BNF notation for a non-context-free grammar; all you need to do is to put more than one terminal on the left-hand side. However, that's not generally very useful in practice, because non-context-free grammars do not provide intuitive descriptions of the structure of the language parsed, nor do they lead to an algorithm for parsing the language. And one would expect that any practical grammar formalism would either give human readers a good description, or allow the automatic generation of a parser, or both. (That doesn't make non-context-free grammars useless in the formal analysis of languages; in the mathematical theory, it is not necessary to please either the reader or the parser generator.)

    But if we restrict ourselves to context-free grammars, we immediately hit a roadblock, because a context-free grammar cannot express duplication, such as { ωω | ω∈Σ* }. Duplication is almost by definition not context-free, because context-free means that the expansion of a non-terminal cannot depend on the context in which the non-terminal appears. Hence, a rule which says "this non-terminal must have the same expansion as that non-terminal", which is required to express duplication, cannot be context-free.

    Of course, the language { ωω | ω∈Σ }, which is what you are looking to describe, is context-free, but that's only because it is possible to enumerate all the possibilities (which must be a finite number, because we insist that the alphabet Σ be a finite set).

    So where does that leave you?

    Basically, you are free to invent whatever formalism suits your purposes, as long as you clearly define its meaning for the reader. That formalism may or may not lead to the possibility of automated parser generation, but if that is not your goal, that fact is irrelevant. Most EBNF dialects -- and there are a lot of them, practically none of which can actually generate a parser without assistance -- allow some way of embedding descriptions written in natural language for syntaxes which are difficult or impossible to describe with a context-free grammar. If you look through EBNF examples, you are likely to find a large constellation of different ways of saying " is any element of the character set" without actually exhaustively listing the entire character set, which given the existence of Unicode would be a ridiculous undertaking. (Although Unicode only has 17*216 codepoints, which is a lot less than 232. But it's still more than a million.)