ccharmultibyte-characters

Actual usages for C multibyte character constants


Can anybody help me understand the actual usages of multibyte character constants in C?

I have seen the following code working just fine, and I want to understand what the actual usage of this language feature is. (I know that defining them is standard C; accessing them, however, is not standard conformant). Someone pointed out to me that these multibyte character constants are useful on platforms like Classic MacOS, but they failed to be able to provide an example.

#include <stdio.h>

int main() {
    (void) 'this'; // this seems to be standard conformant

    // but what can we do with this "feature"?
    // This compiles and runs just fine, but is a crude hack:
    long i = 'this';
    const char* u = (const char*) &i;
    const unsigned z = sizeof(long)/sizeof(char);
    printf("%u\n", z);

    for(unsigned v = 0; v < z; v++)
    {
        printf("%c\n", (char)u[v]);
    }

    return 0;
}

Code output was requested (see here: https://godbolt.org/z/ebsTj4a9E):

8
s
i
h
t

Solution

  • Can anybody help me understand the actual usages of multibyte character constants in C?

    Your wording is a bit of a melange as far as the language spec's terminology goes, but then the spec uses a confusing set of similar terms for similar, but distinct concepts. Among them

    That's a bit of a mess, I think you'll agree, but to its credit, at least the spec avoids throwing the term "multibyte character constant" into the mix as well.

    I think what you're talking about is what the spec describes as "an integer character constant containing more than one character [...] or containing a character or escape sequence that does not map to a single-byte execution character". The values of such constants have type int, as do all character constants, but their values are implementation defined.

    I know that defining them is standard C; accessing them, however, is not standard conformant

    No, that's not a good description. Character constants containing multiple single-byte characters are lexically valid, and, supposing that the implementation accepts them, their values are implementation defined. The spec does not actually bind implementations to accept such constants, however, neither in general nor any particular ones. That's a bit of a problem for "defining them is standard C". On the other hand, in implementations that do accept them, they serve as ordinary constants with whatever int values the implementation attributes to them. In that sense, there is no inherent issue with accessing them.

    The main issue with these is that they are not portable, in the sense that different implementations may attribute different numeric values to the same lexical constant. In truth, however, this is a difference of magnitude, not kind, for exactly the same is true of character constants formed of individual single-byte characters.

    The thing that distinguishes character constants formed of individual single-byte characters is exactly that they do map to individual members of the execution character set, in a predictable way. If you need a portable program then you need to avoid character constants of the kind you ask about. However, if you are content with code that you can rely upon to work correctly only on certain implementations, then "implementation defined" means that the values of such constants are defined, and conforming implementations must each document their definitions. For example, since at least version 4.0, GCC has used this definition:

    The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed [...]. If there are more characters in the constant than would fit in the target int [...] the excess leading characters are ignored.

    (GCC Manual)

    You can rely on that as long as you stick to GCC and any other implementation that guarantees compatibility with GCC in this area. And you might not even need specific values to match across implementations, as long as different constants of interest to you (not exceeding some maximum length, say) can be relied upon to have different values.

    But what can you actually do with them?

    There's not much that I would actually do with them, myself, but I can imagine them being used