c++macosunicodeutf-8boost-locale

Reading UTF-8 data with C++ in Mac not working


Although my C++ experience is quite reduced, I am trying to help a C++ programmer to have his library working on a Mac. At the moment, the problem seems to be only locale/encoding related.

Trying to create a minimal working example, I tested the following code, that reads a line of UTF-8 Characters, to a wide string (wstring) and then goes through the string and prints each character.

While it works perfectly on a Linux box, having all characters printed in a different line, when using a Mac box I get each byte printed per line (and not each character).

The code is:

#include <sstream>
#include <iostream> 
#include <string>
#include <boost/locale.hpp>

using namespace std;

int main() {
    std::ios_base::sync_with_stdio(false);
    boost::locale::generator gen;
    locale mylocale = gen("pt_PT.UTF-8");
    locale::global(mylocale);

    wstring userInput;
    getline(wcin, userInput);

    wcerr << "Size of string is " << userInput.length() << endl;

    for (int i = 0; i < userInput.length(); ++i) {
        wcerr << userInput.at(i) << endl;
    }
    return 0;
}

and my testing string is a stupid Portuguese sentence:

O coração é um órgão frágil.

I am trying with Boost_locale because somebody told me it was he way to get unicode working correctly on a Mac, but I would be happy to have a solution using only the C++ standard libraries.

EDIT:

The following code works on Mac. It doesn't compile on my Linux box because of the codecvt include, but I can manage that with some CPP instructions.

#include <sstream>
#include <iostream> 
#include <fstream>
#include <codecvt>
#include <locale>
#include <string>

using namespace std;

int main() {
    // setting std::local::global seems not to work (??)

    wcin.imbue(std::locale(locale(""), new std::codecvt_utf8<wchar_t>));
    wcerr.imbue(std::locale(locale(""), new std::codecvt_utf8<wchar_t>));

    wstring userInput;
    getline(wcin, userInput);

    wcerr << "Size of string is " << userInput.length() << endl;

    for (int i = 0; i < userInput.length(); ++i) {
        wcerr << userInput.at(i) << endl;
    }
    return 0;
}

Solution

  • This behavior is caused by the fact that in UTF-8 encoding a character, also known as a code point is represented by one or more code units.

    Essentially the:

    for (int i = 0; i < userInput.length(); ++i)
    

    loops through code units. You can verify that behavior by the fact that userInput.length() is a number greater than the number of characters in your string.

    By doing:

    wcerr << userInput.at(i) << endl;
    

    You are appending an endl after each code unit and thus separating code units that belong to the same code point which produces invalid characters.

    If you instead just output:

    wcerr << userInput << endl;
    

    You will get your string intact.

    If you want to output each character separately you will have to take into account multiple code units that belong to the same code point and output them separately.

    UPDATE:

    wcin doesn't do the conversion to code points by default. You need to explicitly state the encoding of the input and convert it. This is essentially what the following code does. The only major difference with your example is that I used the C++11 standard library instead of Boost.

    #include <codecvt>
    #include <iostream>
    
    int main() {
    
        std::locale::global( std::locale( std::locale(""), new std::codecvt_utf8<wchar_t> ) );
    
        std::wcin.imbue( std::locale() );
        std::wcout.imbue( std::locale() );
        std::wcerr.imbue( std::locale() );
    
        std::wstring user_input;
        std::wcin >> user_input;
    
        for( int i = 0; i < user_input.length(); ++i ) {
            std::wcout << user_input[i] << std::endl;
        }
    
        // Converting characters to uppercase
        const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t>>( std::locale() );
    
        for( int i = 0; i < user_input.length(); ++i ) {
            std::wcout << f.toupper(user_input[i]) << std::endl; // f.tolower() for lowercase
        }
    
        return 0;
    }
    

    P.S. To compile that you will need to pass the C++11 standard flag.

    g++ -std=c++11 main.cpp