C++17 introduced the __has_include
preprocessor expression, which has the following two forms:
__has_include(header-name)
__has_include(header-name-tokens)
For the first form, header-name
can be either <h-char-sequence>
or "q-char-sequence"
, which corresponds to the typical header names we use. For the second form, header-name-tokens
can also have two forms: either a string literal or <h-pp-tokens>
.
The standard mentions that the second form is only considered when the first form of the __has_include
expression does not match.
My question is: Since header-name
and header-name-tokens
seem to have almost identical syntax, what kind of code would result in a mismatch for header-name
but a match for header-name-tokens
?
Here are some of my thoughts on this issue.
Let’s first consider the case where header-name-tokens
is a string literal, which corresponds to the "q-char-sequence"
form in header-name
. One difference between them is that the characters in a string literal (enclosed in quotes) use s-char
, which can include escape sequences and universal character names. On the other hand, q-char
cannot include such characters (or the presence of \
in q-char
is conditionally supported), though current implementations seem to support it and do not treat it as an escape sequence. I would expect that __has_include("\n")
, where "\n"
is used, should match as a header-name
rather than a header-name-tokens
.
Furthermore, string literals can have prefixes and suffixes, whereas "q-char-sequence"
cannot. However, both GCC and Clang reject code like __has_include(u"abc")
. I could not find a specific standard section that makes this case illegal.
Now, let’s consider the case where header-name
and header-name-tokens
are enclosed in angle brackets. A straightforward case for header-name-tokens
is that the angle brackets can contain multiple preprocessor tokens, such as in __has_include(<Name1 Name2>)
. However, I believe this situation would still need to match the header-name
token, because Name1 Name2
forms a character sequence that satisfies the h-char-sequence
requirement.
This was very recently reworded by CWG3015 (diff), but the effect is still the same.
For #include
, the grammar is (https://wg21.link/cpp#nt:control-line):
# include
pp-tokens new-line
Then first it tries to match pp-tokens to the form <
h-char-sequence >
, then to "
q-char-sequence "
, and then it would include those files normally. Otherwise, it is recognised as 'normal text', and then check if it matches either of those two forms after macro replacement (https://eel.is/c++draft/cpp.include).
If it didn't do it this way, then this would happen:
#define stdio stdint
#include <stdio.h> // Would try to include stdint.h
__has_include
has to do the same thing thing: If it matches <
h-char-sequence >
or "
q-char-sequence "
(a header-name), do no expansion. If it doesn't, it's 'normal text' (any tokens), so would need to expand them. Unlike # include
, which can just take all lexed tokens up to a new-line, it needs a specific grammar to match until the )
.
Previously, from __has_include
's inclusion in C++17 (P0061R1), it was explicitly spelled out:
The third and fourth forms of has-include-expression are considered only if neither of the first or second forms matches, in which case the preprocessing tokens are processed just as in normal text.
(Where the first and second form were later combined into header-name and the third and fourth to header-name-tokens).
Now this is enforced by the lexer (https://wg21.link/lex.pptoken#4.3):
- Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that
- a header-name is only formed
- immediately after the
include
,embed
, orimport
preprocessing token in a#include
,#embed
, orimport
directive, respectively, or
- immediately after a preprocessing token sequence of
__has_include
or__has_embed
immediately followed by(
in a#if
,#elif
, or#embed
directive and
- a string-literal token is never formed when a header-name token can be formed.
So to answer your question directly:
The second form is used when you don't literally have the characters <
or "
followed by the appropriate string characters then the matching >
or "
. Then the lexer does not produce a header-name token, and it must be parsed as something else.
For example:
#define HEADER <stdint.h>
#define h non_existant_file_extension
#if __has_include(HEADER) // Second form, expands h, -> 0
#error
#endif
#if !__has_include(<stdint.h>) // First form, does not expand h, -> 1
#error
#endif
#undef h
#if !__has_include(HEADER) // No longer expands h
#error
#endif
#if !__has_include(<stdint.h>) // Unaffected
#error
#endif
And the reason __has_include(u"abc")
is ill-formed is after macro expansion u"abc"
doesn't match "
q-char-sequence "
.