clanguage-lawyertrigraphs

Meaning of character literals containing trigraphs for non-representable characters


On a C compiler which uses ASCII as its character set, the value of the character literal '??<' would be equivalent to that of '{', i.e. 0x7B. What would be the value of that literal on a compiler whose character set doesn't have a { character?

Outside a string literal, a compiler could infer that ??< is supposed to have the same meaning as an open-brace character is defined to have, even if the compiler character set doesn't have an open-brace character. Indeed, the whole purpose of trigraphs is to allow the use of sequences of representable characters to be used in place of characters that aren't representable. The spec requires that trigraphs even be processed within string literals, however, which has me puzzled. If a compiler's character set includes a { character, the compiler can allow '{' to be represented as '??<', but the character set includes { I see no reason a programmer wouldn't simply use that. If the character set doesn't include {, however, which would seem the only reason for using trigraphs in the first place, what representable character would a compiler be expected to replace ??< with?


Solution

  • When it comes to considerations about the environment, especially to files, the C standard intentionally becomes rather vague. The following guarantees are made about trigraphs and the encoding of their corresponding characters:

    C11 (n1570) 5.1.1.2 p1 (“Translation phases”) [emph. mine]

    1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

    Thus, the trigraph sequence must be mapped to a single byte. This single-byte character must be in the basic character set different from any other character in the basic character set. How the compiler handles them internally during translation isn’t really observable behaviour, so it’s irrelevant.

    If written to a text stream it may be converted (as I read it, maybe back to a trigraph sequence if the underlying encoding doesn’t have an encoding for a certain character). It can be read back again, and must compare equal if it is considered a printing character. Ibid. 7.21.2 p2:

    […] Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. […]

    Ibid. 7.4 p3:

    The term printing character refers to a member of a locale-specific set of characters, each of which occupies one printing position on a display device; the term control character refers to a member of a locale-specific set of characters that are not printing characters.*) All letters and digits are printing characters.

    *) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).

    And for binary streams, ibid. 7.21.2 p3:

    A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation. Such a stream may, however, have an implementation- defined number of null characters appended to the end of the stream.

    In the comments above, the question arose if

    printf("int main(void) ??< ??>\n");     // (1) 
    printf("int main(void) ?\?< ?\?>\n");   // (2)
    

    always works for code generation and the output of that statement is guaranteed to be compilable. I couldn’t find a normative reference requiring isprint('??<') etc. (for (1)) or even isprint('<') etc (for (2)) to return non-zero, but the C89 rationale about streams says:

    The set of characters required to be preserved in text stream I/O are those needed for writing C programs; the intent is the Standard should permit a C translator to be written in a maximally portable fashion. Control characters such as backspace are not required for this purpose, so their handling in text streams is not mandated.

    When '??<' etc. is written to a binary stream, it must map to a single byte, be printed as such, be unique and distinguishable from any other basic character, and compare equal to '??<' when read back.


    Related: C89 rationale about trigraphs.