unicodegrammarabnf

Unicode version of ABNF?


I want to write a grammar for a file format whose content can contain characters other than US-ASCII ones. Since I am used to ABNF, I try to use it...

However, none of RFCs 5234 and 7405 are very friendly towards people who DO NOT use US ASCII.

In fact, I'm looking for an ABNF version (and possibly some basic rules as well) which is character oriented rather than byte oriented; the only thing which RFC 5234 has to say about this is in section 2.4:

2.4.  External Encodings

   External representations of terminal value characters will vary
   according to constraints in the storage or transmission environment.
   Hence, the same ABNF-based grammar may have multiple external
   encodings, such as one for a 7-bit US-ASCII environment, another for
   a binary octet environment, and still a different one when 16-bit
   Unicode is used.  Encoding details are beyond the scope of ABNF,
   although Appendix B provides definitions for a 7-bit US-ASCII
   environment as has been common to much of the Internet.

   By separating external encoding from the syntax, it is intended that
   alternate encoding environments can be used for the same syntax.

That doesn't really clarify matters.

Is there a version of ABNF somewhere which is code point oriented rather than byte oriented?


Solution

  • If the ABNF you're writing is intended for human reading, then I'd say just use the normal syntax and refer to code points instead of bytes instead. You could take a look at various language specifications that allow Unicode in source text, e.g. C#, Java, PowerShell, etc. They all have a grammar, and they all have to define Unicode characters somewhere (e.g. for identifiers).

    E.g. the PowerShell grammar has lines like this:

    double-quote-character:
           " (U+0022)
           Left double quotation mark (U+201C)
           Right double quotation mark (U+201D)
           Double low-9 quotation mark (U+201E)

    Or in the Java specification:

    UnicodeInputCharacter:
           UnicodeEscape
           RawInputCharacter

    UnicodeEscape:
           \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

    UnicodeMarker:
           u
           UnicodeMarker u

    RawInputCharacter:
           any Unicode character

    HexDigit: one of
           0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

    The \, u, and hexadecimal digits here are all ASCII characters.

    Note that there is surrounding text explaining the intent – which is always better than just dumping a heap of grammar on someone.

    If it's for automatic parser generation, you may be better off finding a tool that allows you to specify a grammar both in Unicode and ABNF-like form and publish that instead. People writing parsers should be expected to understand either, though.