ccompilationlanguage-lawyerspecifications

What, exactly, is the total number of characters in C's "basic execution character set"?


The following pages on cppreference.com seem jointly inconsistent:

  1. https://en.cppreference.com/w/c/language/charset

Basic character set is also known as basic source character set. The basic execution character set contains all the members of the basic character set, plus the following characters: U+0000 Null U+0007 Bell U+0008 Backspace U+000A Line feed (LF) U+000D Carriage return (CR)

  1. https://en.cppreference.com/w/c/language/memory_model

A byte is the smallest addressable unit of memory. It is defined as a contiguous sequence of bits, large enough to hold any member of the basic execution character set (the 96 characters that are required to be single-byte). C supports bytes of sizes 8 bits and greater.

  1. https://en.cppreference.com/w/c/language/translation_phases

The source character set is a multibyte character set which includes the basic source character set as a single-byte subset, consisting of the following 96 characters: ...

... as long as all 96 characters from the basic source character set listed in phase 1 have single-byte representations

It's not logically possible for both basic execution character set (BECS) and basic source character set (BSCS) to have 96 characters and for BECS to contain all members of BSCS plus 5 control characters.


The latest C23 draft is here: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf

The latest draft of C17 is here: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf

The most relevant passages can be located by ctrl+f for the string "Both the basic source and basic execution character sets" in these pdf files.

I get the impression from reading these that the BECS is a proper superset of BSCS, just as stated on cppreference's charset page. But I hesitate to conclude, as I seem forced to do, that cppreference (which I find normally very reliable) is wrong both on the memory model page and on the translation phase page. So, what's going on?


Solution

  • From the C23 draft N3096, there are 95 members of the basic character set, not including new-line. (The later draft N3219 may include 3 additional characters.1)

    The basic execution character set includes the 95 characters of the basic character set, the null character, an alert character, a backspace character, a new-line character, and a carriage return character, bringing the total to 100.

    The basic source character set includes the basic character set and the new-line character that is used to replace the end-of-line indicators in translation phase 1, bringing the total to 96.

    The basic execution character set is a proper superset of the basic source character set (although the actual character codes may differ). The members of the basic character set and the new-line character are present in both, but the basic execution character set includes 4 characters that are not present in the basic source character set.

    Regarding the pages on cppreference.com:

    1. https://en.cppreference.com/w/c/language/charset

      1. The "Code unit" columns show the ASCII character codes, but has nothing to do with the C standard. The horizontal tab character is listed as "Character tabulation", and the vertical tab character is listed as "Line tabulation".

      2. The paragraph:

        Basic character set is also known as basic source character set.

        ignores the new-line character that becomes part of the basic source character set during translation phase 1.

      3. The new-line character is listed as "Line feed (LF)" in the list of members of the basic execution character set. (In ASCII-based execution character sets, C's new-line character is usually (always?) mapped to ASCII LF.) The alert character is listed as "Bell".

    2. https://en.cppreference.com/w/c/language/memory_model

      1. The phrase:

        the basic execution character set (the 96 characters that are required to be single-byte)

        does not take into account the null, alert, backspace, and carriage-return characters, but does take into account the new-line character. (As of C23, the characters @, $, and ` characters are also required to have a single-byte encoding. They are extended characters, not part of the basic execution character set, but are required to be present.)

    3. https://en.cppreference.com/w/c/language/translation_phases

      1. The description of the 96 characters of the basic source character set includes the new-line character that is used to replace the end-of-line indicators. (As of C23, the characters @, $, and ` characters are also required to have a single-byte encoding. They are extended characters, not part of the basic source character set, but are required to be present.)

    After C231, indications are that the following version of the C standard, code name C2Y will include the @, $, and ` characters as part of the basic character set. This will increase the number of members of the basic character set to 98, the number of members of the basic source character set to 99 (including the new-line character that replaces end-of-line indicators in translation phase 1), and the number of members of the basic execution character set to 103. (It is possible that this has already happened in C23.1)


    1 It seems highly likely that this change has made it into the latest C23 draft (N3219) but since it is not publically available (it is only downloadable as a password-protected zip file), I have not confirmed it. However, the C2Y draft (N3220) does have the change, and according to the editor's report N3221, the only change between N3219 and N3220 is an editorial change to a footnote in Annex K.