cencodingprintfwidechar

Displaying wide chars with printf


I'm trying to understand how does printf work with wide characters (wchar_t).

I've made the following code samples :

Sample 1 :

#include <stdio.h>
#include <stdlib.h>

int     main(void)
{
    wchar_t     *s;

    s = (wchar_t *)malloc(sizeof(wchar_t) * 2);
    s[0] = 42;
    s[1] = 0;
    printf("%ls\n", s);
    free(s);
    return (0);
}

output :

*

Everything is fine here : my character (*) is correctly displayed.

Sample 2 :

I wanted to display an other kind of character. On my system, wchar_t seem encoded on 4 bytes. So I tried to display the following character : É

#include <stdio.h>
#include <stdlib.h>

int     main(void)
{
    wchar_t     *s;

    s = (wchar_t *)malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC389;
    s[1] = 0;
    printf("%ls\n", s);
    free(s);
    return (0);
}

But there is no output this time, I tried with many values from the "encoding" section (cf. previous link) for s[0] (0xC389, 201, 0xC9)... But I never get the É character displayed. I also tried with %S instead of %ls.

If I try to call printf like this : printf("<%ls>\n", s) the only character printed is '<', the display is truncated.

Why do I have this problem? How should I do?


Solution

  • Why do I have this problem?

    Make sure you check errno and the return value of printf!

    #include <stdio.h>
    #include <stdlib.h>
    #include <wchar.h>
    
    int main(void)
    {
        wchar_t *s;
        s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
        s[0] = 0xC389;
        s[1] = 0;
    
        if (printf("%ls\n", s) < 0) {
            perror("printf");
        }
    
        free(s);
        return (0);
    }
    

    See the output:

    $ gcc test.c && ./a.out
    printf: Invalid or incomplete multibyte or wide character
    

    How to fix

    First of all, the default locale of a C program is C (also known as POSIX) which is ASCII-only. You will need to add a call to setlocale, specifically setlocale(LC_ALL,"").

    If your LC_ALL, LC_CTYPE or LANG environment variables are not set to allow UTF-8 when blank, you'll have to explicitly select a locale. setlocale(LC_ALL, "C.UTF-8") works on most systems - C is standard, and the UTF-8 subset of C is generally implemented.

    #include <stdio.h>
    #include <stdlib.h>
    #include <locale.h>
    #include <wchar.h>
    
    int main(void)
    {
        wchar_t *s;
        s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
        s[0] = 0xC389;
        s[1] = 0;
    
        setlocale(LC_ALL, "");
    
        if (printf("%ls\n", s) < 0) {
            perror("printf");
        }
    
        free(s);
        return (0);
    }
    

    See the output:

    $ gcc test.c && ./a.out
    쎉
    

    The reason why the incorrect character printed out is because wchar_t represents a wide character (such as UTF-32), not a multibyte character (such as UTF-8). Note that wchar_t is always 32 bits wide in the GNU C Library, but the C standard doesn't require it to be. If you initialize the character using the UTF-32BE encoding (i.e. 0x000000C9), then it prints out correctly:

    #include <stdio.h>
    #include <stdlib.h>
    #include <locale.h>
    #include <wchar.h>
    
    int main(void)
    {
        wchar_t *s;
        s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
        s[0] = 0xC9;
        s[1] = 0;
    
        setlocale(LC_ALL, "");
    
        if (printf("%ls\n", s) < 0) {
            perror("printf");
        }
    
        free(s);
        return (0);
    }
    

    Output:

    $ gcc test.c && ./a.out
    É
    

    Note that you can also set the LC (locale) environment variables via command line:

    $ LC_ALL=C.UTF-8
    $ ./a.out
    É