cunicodelanguage-lawyerlanguage-specifications

Does C language spec guarantee mapping of Unicode code points to numerical wchar_t values?


I know I can write a non-ASCII character literal using a Unicode escape sequence like:

wchar_t myChar = L'\u00C6';

But, is there any guarantee that the resulting numerical value of myChar is actually hexadecimal C6? Or, does the C language specification leave this as an implementation-defined detail?

Section 6.10.8 of this (apparent?) draft spec seems to imply that such a guarantee exists only if the optional __STDC_ISO_10646__ macro is defined (I guess either explicitly or as a compiler default). But, I'm not 100% sure of my understanding, or of how official that doc is (the truly official spec seems hidden behind a paywall). So, I'm wondering whether anyone knows for sure.


Update:

To clarify, this question has nothing to do with the issue of Unicode characters that don't fit in 16 bits. It has to do with the relationship between a character's "short identifier" (the hexadecimal code shown on unicode.org charts and used in the escape code) versus the corresponding numerical value of the wchar_t variable. That is, whether this code:

wchar_t myChar = L'\u00C6';
printf("%04X", myChar);

could result in output such as:

007B

The value of 007B is arbitrary - the point is just it being something other than 00C6. I’m not aware of anything in the language specification that requires the numerical value of the wchar_t to equal the "short identifier" (as a concrete hypothetical example, imagine a C language implementation which maps each character to a wchar_t whose numerical value is the 2's complement of the "short identifier").


Solution

  • Regarding the __STDC_ISO_10646__ macro, I think your reading of the standard is correct. Quoting the N1570 draft of the C11 standard:

    __STDC_ISO_10646__An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.

    If the macro is defined, the integer value of the wchar_t object will equal the hex value of the character's short identifier. Note that this doesn't apply to random hex strings, only the "required set".