I'm learning digraph and trigraph, and here is the code which I cannot understand. (Yes, I admit that it's extremely ugly.)
This code can compile:
#define _(s) s%:%:s
main(_(_))
<%
__;
%>t
This code can compile, too:
#define _(s) s??=??=s
main(_(_))
<%
__;
%>
However, neither of the following two pieces of code can compile:
#define _(s) s%:??=s
main(_(_))
<%
__;
%>
And
#define _(s) s??=%:s
main(_(_))
<%
__;
%>
This does confuse me: Since the first two pieces of code can compile, I suppose the expansion of digraph and trigraph both take place before the macro expansion. So why can't it compile when digraph and trigraph are used together?
Digraphs and trigraphs are totally different. Trigraphs are replaced during phase 1 of translation, [see Note 1] which is before the source code has been separated into tokens. Digraphs are tokens which are alternate spellings for other tokens, so they are not meaningful until after the source has been separated into tokens. (The word "digraph" is not very accurate; it is used because it resembles "trigraph", but the set of digraphs includes %:%:
which consists of four characters.)
So ??=
is replaced with a #
before any token analysis is done. But %:
is just a token, with the same meaning as #
.
Moreover, %:%:
is a token with the same meaning as ##
. But %:#
is two tokens (%:
and #
), which is not legal since the stringify operator (whether spelled %:
or #
) can only be followed by a macro parameter. [See Note 2] And it does not become any less illegal if the #
were the result of a trigraph substitution.
One important difference between digraphs and trigraphs, as illustrated by the hilarious snippet in chqrlie's answer, is that trigraphs also work in strings. Digraphs allow you to write C code even if your keyboard lacks brackets and octothorpi, but they don't help you print those characters out.
§5.1.1.2, Translation phases, paragraph 1:
The precedence among the syntax rules of translation is specified by the following phases.
- Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
§6.10.3.2, The # operator, paragraph 1:
Each # preprocessing token in the replacement list for a function-like macro shall be followed by a parameter as the next preprocessing token in the replacement list.