cwinapiunicodeutf-8utf-16

Converting UTF-16 to UTF-8 using WideCharToMultiByte in C on Windows


I am trying to convert Windows wchar_t[] to a UTF-8 encoding char[] so that calls to WriteFile will produce UTF-8 encoded files. I have the following code:

#include <windows.h>
#include <fileapi.h>
#include <stringapiset.h>

int main() {
    HANDLE file = CreateFileW(L"test.txt", GENERIC_ALL, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    const wchar_t source[] = L"hello";
    char buffer[100];
    WideCharToMultiByte(CP_UTF8, 0, source, sizeof(source)/sizeof(source[0]), buffer, sizeof(buffer)/sizeof(buffer[0]), NULL, NULL);
    WriteFile(file, buffer, sizeof(buffer), NULL, NULL);
    return CloseHandle(file);
}

This produces a file containing: "hello" but also a large amount of garbage after it. enter image description here

Something about this caused me to think the issue was more than just simply dumping the excess characters in buffer and that the conversion wasn't happening properly, so I changed the source text as follows:

const wchar_t source[] = L"привет";

And this time got the following garbage:

enter image description here

So then thinking maybe it's getting confused because it's looking for a null terminator and not finding one, even though lengths are specified? So I change the source string again:

const wchar_t source[] = L"hello\n";

And got the following garbage:

enter image description here

I'm fairly new to the WinAPI's, and am not primarily a C developer, so I'm sure I'm missing something, I just don't know what else to try.

edit: Following the advice from RbMm has removed the excess garbage, so English prints correctly. However, the Russian is still garbage, just shorter garbage. Contrary to zett42's comment, I am most definately using a UTF-8 text editor.

enter image description here

UTF-8 doesn't need a BOM, but adding one in anyways produces:

enter image description here

Well that's odd. I expected the same text with a slightly larger binary size. Instead there's nothing.

edit:

Since some are keen on sticking to the idea that I'm using WordPad, here's what WordPad looks like

enter image description here

I'm clearly not using WordPad. I'm using VS Code, although the garbage is indentical whether opened in VS Code, Visual Studio, Notepad, or Notepad++.

edit:

Here's the hex dump of the output from Russian:

enter image description here


Solution

  • Update 3: The hex output suggests that the source file has been misinterpreted somewhere along the compilation. Instead of using UTF-8, Windows Codepage 1252 has been used, which means the string has the wrong encoding in the compiled program. The stored byte sequence in the output file is therefore C3 90 C2 Bf C3 91 E2 82 AC C3 90 C2 B8 90 C2 B2 C3 90 C2 B5 C3 91 E2 80 9A instead of the correct D0 BF D1 80 D0 B8 D0 B2 D0 B5 D1 82.

    How to solve this problem depends on the toolchain. The MSVC has the /utf-8 flag to set the source and execution charset. You might think that this is quite redundant since you've already saved your source file as UTF-8? Turns out WordPad isn't the only software that requires a BOM to detect UTF-8. The following excerpt from the documentation explains the reason for the whole encoding problem.

    By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option.

    In Visual Studio 17 you can also configure the charset by setting Character Set in Configuration Properties > General > Project Defaults. If you use cmake you will likely not encounter this problem because it configures everything properly out of the box.

    Update 2: Some editors may not be able to deduce that the content is UTF-8 from a short byte sequence like this, which will result in the garbled output you've seen. You could add the UTF-8 byte order mark (BOM) at the beginning of the file to help these editors, although it's not considered a best practice since it conflates metadata and content, breaks ASCII backward compatibility and UTF-8 can be properly detected without it. It's mostly legacy software like Microsoft's WordPad that needs the BOM to interpret the file as UTF-8.

    if (WriteFile(file, "\xef\xbb\xbf", 3, NULL, NULL) == 0) { goto error; }
    

    Update: Code with a bit of basic error handling:

    #include <windows.h>
    #include <fileapi.h>
    #include <stringapiset.h>
    
    int main() {
        int ret_val = -1;
    
        const wchar_t source[] = L"привет";
    
        HANDLE file = CreateFileW(L"test.txt", GENERIC_ALL, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    
        if (file == INVALID_HANDLE_VALUE) { goto error_0; }
    
        size_t required_size = WideCharToMultiByte(CP_UTF8, 0, source, -1, NULL, 0, NULL, NULL);
    
        if (required_size == 0) { goto error_0; }
    
        char *buffer = calloc(required_size, sizeof(char));
    
        if (buffer == NULL) { goto error_0; }
    
        if (WideCharToMultiByte(CP_UTF8, 0, source, -1, buffer, required_size, NULL, NULL) == 0) { goto error_1; }
    
        if (WriteFile(file, buffer, required_size - 1, NULL, NULL) == 0) { goto error_1; }
    
        if (CloseHandle(file) == 0) { goto error_1; }
    
        ret_val = 0;
    
    error_1:
        free(buffer);
    
    error_0:
        return ret_val;
    }
    

    Old: You can do the following which will create the file just fine. The first call to WideCharToMultiByte is used to determine the number of bytes required to store the UTF-8 string. Make sure to save the source file as UTF-8 otherwise the source string will not be properly encoded in the source file.

    The following code is just a quick and dirty example and lacks rigorous error handling.

    #include <windows.h>
    #include <fileapi.h>
    #include <stringapiset.h>
    
    int main() {
        HANDLE file = CreateFileW(L"test.txt", GENERIC_ALL, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
        const wchar_t source[] = L"привет";
    
        size_t required_size = WideCharToMultiByte(CP_UTF8, 0, source, -1, NULL, 0, NULL, NULL);
    
        char *buffer = (char *) calloc(required_size, sizeof(char));
    
        WideCharToMultiByte(CP_UTF8, 0, source, -1, buffer, required_size, NULL, NULL);
        WriteFile(file, buffer, required_size - 1, NULL, NULL);
        free(buffer);
        return CloseHandle(file);
    }