objective-cutf-8charnsstringnsstringencoding

Why do I get different UTF-8 representations of an NSString depending on string construction or when running in different environments?


I have some very simple Objective-C code that allocates and initialises an NSString and then gets the UTF-8 const char * representation of that string as follows:

const char *s = [[[NSString alloc] initWithFormat:@"%s", "£"] UTF8String];

I then print out the hex values of the code units that make up this string using this code:

while(*s)
    printf("%02x ", (unsigned int) *s++);

and I get the following output:

ffffffc2 ffffffac ffffffc2 ffffffa3 

This is unexpected as I'd assume I'd just get ffffffc2 ffffffa3, seeing as the £ character is made up of two code units, represented in hex as c2 followed by a3, as you can see here.

Here's a screenshot of this output in the simplest iOS app imaginable running locally on my laptop:

Xcode window showing hex output of UTF8 string

Note that the output is the same if I create the NSString as follows:

[[NSString alloc] initWithFormat:@"%s", "\xc2\xa3"]

If I instead use an NSString as the argument to be interpolated into the format string then I get the expected output of ffffffc2 ffffffa3:

[[NSString alloc] initWithFormat:@"%@", @"£"]

What's even stranger to me is that exactly the same failing code as I have above (the first version) seems to work as I'd expect when on an online Objective C codepen-type site I found, which you can see here.

Why are the extra code units being added to the UTF-8 representation of the string when I use the initWithFormat:@"%s" version of the code, and seemingly only when I run it on my machine?


Solution

  • The C language does not specify the encoding of strings, rather it specifies a set of characters that must be included in the source character set and that each character is a byte.

    When compiling (Objective-)C the Apple Clang compiler appears to follow this, the encoding of the characters in a C string is based on the encoding of the source file. The default encoding for source files is UTF-8 and so the C string literal "£" is stored as the bytes c2, a3, 00 being the UTF-8 encoding for "£" and a null byte.

    As @Wileke remarked the %s string format interprets its argument according to the system default encoding (documentation). This default encoding appears to be MacOSRoman, in that encoding the byte c2 is the character "¬" and the byte a3 is the character "£", and so the string you produce from stringWithFormat: has those two characters in it.

    As you have already suggested in your comments you can address your problem by using initWithUTF8String:, which will work provided your source file encoding is UTF-8. If your source file uses a different encoding you should instead use initWithCString:encoding: and specify the encoding of your source file.

    If you are unsure of your source file encoding select the file in Xcode and look at the inspect pane, there you can see and change (either reinterpreting or converting the existing bytes) the encoding.

    Note: If in your real code the C string is not being formed from a string literal in the same file you will have to determine the encoding of that string.

    HTH