c++fileunicodewofstream

Why does wide file-stream in C++ narrow written data by default?


Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!

Also, are we gonna get real unicode streams with C++0x or am I missing something here?


Solution

  • The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

    Two main points:

    So to get anything, you have to set the locale.

    If I use the simple program

    #include <locale>
    #include <fstream>
    #include <ostream>
    #include <iostream>
    
    int main()
    {
        wchar_t c = 0x00FF;
        std::locale::global(std::locale(""));
        std::wofstream os("test.dat");
        os << c << std::endl;
        if (!os) {
            std::cout << "Output failed\n";
        }
    }
    

    which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

    $ env LC_ALL=C ./a.out
    Output failed
    

    the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

    $ env LC_ALL=en_US.utf8 ./a.out
    $ od -t x1 test.dat
    0000000 c3 bf 0a
    0000003
    

    (od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.