c++language-lawyer c-preprocessor preprocessor-directive

In which situations does the second form of `__has_include` appear?

Problem Description

C++17 introduced the __has_include preprocessor expression, which has the following two forms:

__has_include(header-name)
__has_include(header-name-tokens)

For the first form, header-name can be either <h-char-sequence> or "q-char-sequence", which corresponds to the typical header names we use. For the second form, header-name-tokens can also have two forms: either a string literal or <h-pp-tokens>.

The standard mentions that the second form is only considered when the first form of the __has_include expression does not match.

My question is: Since header-name and header-name-tokens seem to have almost identical syntax, what kind of code would result in a mismatch for header-name but a match for header-name-tokens?

Some Discussion on the Problem

Here are some of my thoughts on this issue.

Let’s first consider the case where header-name-tokens is a string literal, which corresponds to the "q-char-sequence" form in header-name. One difference between them is that the characters in a string literal (enclosed in quotes) use s-char, which can include escape sequences and universal character names. On the other hand, q-char cannot include such characters (or the presence of \ in q-char is conditionally supported), though current implementations seem to support it and do not treat it as an escape sequence. I would expect that __has_include("\n"), where "\n" is used, should match as a header-name rather than a header-name-tokens.

Furthermore, string literals can have prefixes and suffixes, whereas "q-char-sequence" cannot. However, both GCC and Clang reject code like __has_include(u"abc"). I could not find a specific standard section that makes this case illegal.

Now, let’s consider the case where header-name and header-name-tokens are enclosed in angle brackets. A straightforward case for header-name-tokens is that the angle brackets can contain multiple preprocessor tokens, such as in __has_include(<Name1 Name2>). However, I believe this situation would still need to match the header-name token, because Name1 Name2 forms a character sequence that satisfies the h-char-sequence requirement.

Solution

This was very recently reworded by CWG3015 (diff), but the effect is still the same.

For #include, the grammar is (https://wg21.link/cpp#nt:control-line):

# include pp-tokens new-line

Then first it tries to match pp-tokens to the form < h-char-sequence >, then to " q-char-sequence ", and then it would include those files normally. Otherwise, it is recognised as 'normal text', and then check if it matches either of those two forms after macro replacement (https://eel.is/c++draft/cpp.include).

If it didn't do it this way, then this would happen:

#define stdio stdint
#include <stdio.h>  // Would try to include stdint.h

__has_include has to do the same thing thing: If it matches < h-char-sequence > or " q-char-sequence " (a header-name), do no expansion. If it doesn't, it's 'normal text' (any tokens), so would need to expand them. Unlike # include, which can just take all lexed tokens up to a new-line, it needs a specific grammar to match until the ).

Previously, from __has_include's inclusion in C++17 (P0061R1), it was explicitly spelled out:

The third and fourth forms of has-include-expression are considered only if neither of the first or second forms matches, in which case the preprocessing tokens are processed just as in normal text.

(Where the first and second form were later combined into header-name and the third and fourth to header-name-tokens).

Now this is enforced by the lexer (https://wg21.link/lex.pptoken#4.3):

Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that

a header-name is only formed

immediately after the include, embed, or import preprocessing token in a #include, #embed, or import directive, respectively, or

immediately after a preprocessing token sequence of __has_include or __has_embed immediately followed by ( in a #if, #elif, or #embed directive and

a string-literal token is never formed when a header-name token can be formed.

So to answer your question directly:

The second form is used when you don't literally have the characters < or " followed by the appropriate string characters then the matching > or ". Then the lexer does not produce a header-name token, and it must be parsed as something else.

For example:

#define HEADER <stdint.h>
#define h non_existant_file_extension

#if __has_include(HEADER) // Second form, expands h, -> 0
#error
#endif
#if !__has_include(<stdint.h>)  // First form, does not expand h, -> 1
#error
#endif

#undef h
#if !__has_include(HEADER) // No longer expands h
#error
#endif
#if !__has_include(<stdint.h>)  // Unaffected 
#error
#endif

And the reason __has_include(u"abc") is ill-formed is after macro expansion u"abc" doesn't match " q-char-sequence ".