I know I can write a non-ASCII character literal using a Unicode escape sequence like:
wchar_t myChar = L'\u00C6';
But, is there any guarantee that the resulting numerical value of myChar
is actually hexadecimal C6
? Or, does the C language specification leave this as an implementation-defined detail?
Section 6.10.8 of this (apparent?) draft spec seems to imply that such a guarantee exists only if the optional __STDC_ISO_10646__
macro is defined (I guess either explicitly or as a compiler default). But, I'm not 100% sure of my understanding, or of how official that doc is (the truly official spec seems hidden behind a paywall). So, I'm wondering whether anyone knows for sure.
Update:
To clarify, this question has nothing to do with the issue of Unicode characters that don't fit in 16 bits. It has to do with the relationship between a character's "short identifier" (the hexadecimal code shown on unicode.org charts and used in the escape code) versus the corresponding numerical value of the wchar_t
variable. That is, whether this code:
wchar_t myChar = L'\u00C6';
printf("%04X", myChar);
could result in output such as:
007B
The value of 007B
is arbitrary - the point is just it being something other than 00C6
. I’m not aware of anything in the language specification that requires the numerical value of the wchar_t
to equal the "short identifier" (as a concrete hypothetical example, imagine a C language implementation which maps each character to a wchar_t
whose numerical value is the 2's complement of the "short identifier").
Regarding the __STDC_ISO_10646__
macro, I think your reading of the standard is correct. Quoting the N1570 draft of the C11 standard:
__STDC_ISO_10646__
An integer constant of the formyyyymmL
(for example,199712L
). If this symbol is defined, then every character in the Unicode required set, when stored in an object of typewchar_t
, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.
If the macro is defined, the integer value of the wchar_t
object will equal the hex value of the character's short identifier. Note that this doesn't apply to random hex strings, only the "required set".