c++windowsinputunicodeutf-8

Inconsistent format of UTF-8 characters in C++


I've been trying to develop a small console app in C++ for Windows that interacts with an SQLite database. However, this Database may contain UTF-8 characters, e.g. Greek letters. Therefore it is necessary for the program to input these characters from the user console, use them in queries and output them.

I'd like to input these characters using getline or ideally getch.

At first not even simply inputing and outputing a utf-8 string worked.

Using

    SetConsoleOutputCP(CP_UTF8);
    SetConsoleCP(CP_UTF8);

Caused the input strings to have the correct length, but all the characters where null. E.g. inputting "ΑΒΓ" would store a string of 3 characters of value 0.

Using 1253 (The code page for Greek) instead of CP_UTF8 worked for concatenating, inputting and outputing. I noticed in the debugger that the values of the strings were invalid and wouldn't show up correctly. I also noticed that instead of 2 bytes per character there was only 1, but since it would output fine I didn't think much of it, as there can't have been loss of data.

However, the SQLite API didn't agree. Constructing the query using the user input would give no result. Executing a query using a hardcoded string literal with the same utf-8 character would give a result, but both the hardcoded query and the results had a different format than the user input I got before. They would render correctly in the debugger, had 2 bytes per character, and would output giberish under the 1253 CP but correctly under CP_UTF8.

I have not found any reference to this discrepancy online, though I'm not sure If I even looked in the right places. Since the SQLite results output correctly under CP_UTF8, I'd like at the very least to be able to input characters in the SQLite API desired format (i.e. the same format that literals are stored), or at least be able to convert from the current 1253 format to it.

Below is a minimal reproducible example with the concepts I mentioned above:

#include <iostream>
#include<string>
#include<windows.h>

using namespace std;

int main()

{

    SetConsoleOutputCP(CP_UTF8);
    SetConsoleCP(CP_UTF8);

    //Dysfunctional Input
    string input1;
    cout << "Enter greek letters: ";
    cin >> input1;
    cout << "You entered: " << input1 << endl; //input1 has correct size but all chars are null

    SetConsoleOutputCP(1253);
    SetConsoleCP(1253);

    //Working Input and Output
    string input2;
    cout << "Enter greek letters: ";
    cin >> input2;
    cout << "You entered: " << input2 << endl; //Should output properly
    
    //Printing a literal
    string literall = u8"Γειά σας";
    cout << "Literal under 1253: " << literall << endl; //Giberish

    //Printing a literal under CP_UTF8
    SetConsoleOutputCP(CP_UTF8);
    cout << "Literal under CP_UTF8: " << literall << endl; //Correct output

    return 0;
}

The above code shows similar results with getline and _getch instead of cin which is promising.

Final Notes:


Solution

  • UTF-8 and the Windows Console

    I have never been able to make the Windows Console work directly with UTF-8 without kludging something, either because of Windows itself or the compiler/version/libc/etc.

    However, using the Console API with wide-character functions always works. So to make UTF-8 I/O work, you need to imbue the standard streams with converting filters appropriately. The following sets things up:

    #include <windows.h>
    #include <shellapi.h>
    
    #pragma comment(lib, "Shell32")
    
    ///////////////////////////////////////////////////////////////////////////////////////////////////
    namespace duthomhas::utf8::console
    ///////////////////////////////////////////////////////////////////////////////////////////////////
    {
    
    //-------------------------------------------------------------------------------------------------
    struct Input: public std::streambuf
    //-------------------------------------------------------------------------------------------------
    {
      using int_type = std::streambuf::int_type;
      using traits   = std::streambuf::traits_type;
    
      HANDLE  handle;
      char    buffer[ 4 ];
      wchar_t c;
      DWORD   n;
    
      Input( HANDLE handle ): handle(handle) { }
      Input( const Input& that ): handle(that.handle) { }
    
      virtual int_type underflow() override
      {
        auto ok = ReadConsoleW( handle, &c, 1, &n, NULL );
        if (!ok or !n) return traits::eof();
        if (c == '\r') return underflow();
    
        n = WideCharToMultiByte( CP_UTF8, 0, (const wchar_t*)&c, 1, (char*)buffer, sizeof( buffer ), NULL, NULL );
        setg( buffer, buffer, buffer + n );
    
        return n ? traits::to_int_type( *buffer ) : traits::eof();
      }
    };
    
    //-------------------------------------------------------------------------------------------------
    struct Output: public std::streambuf
    //-------------------------------------------------------------------------------------------------
    {
      using int_type = std::streambuf::int_type;
      using traits   = std::streambuf::traits_type;
    
      HANDLE      handle;
      std::string buffer;
    
      Output( HANDLE handle ): handle(handle) { }
      Output( const Output& that ): handle(that.handle) { }
    
      virtual int_type sync() override
      {
        DWORD n;
        std::wstring s( buffer.size(), 0 );
        s.resize( MultiByteToWideChar( CP_UTF8, 0, (char*)buffer.c_str(), (int)buffer.size(), (wchar_t*)s.c_str(), (int)s.size() ) );
        if (buffer.size() and s.empty()) return -1;
        buffer.clear();
        return WriteConsoleW( handle, (wchar_t*)s.c_str(), (DWORD)s.size(), &n, NULL ) ? 0 : -1;
      }
    
      virtual int_type overflow( int_type value ) override
      {
        buffer.push_back( traits::to_char_type( value ) );
        if (traits::to_char_type( value ) == '\n') sync();
        return value;
      }
    };
    
    //-------------------------------------------------------------------------------------------------
    void initialize()
    //-------------------------------------------------------------------------------------------------
    {
      // Update the standard I/O streams, maybe
      DWORD mode; HANDLE
      handle = GetStdHandle( STD_INPUT_HANDLE  ); if (GetConsoleMode( handle, &mode )) std::cin .rdbuf( new Input ( handle ) );
      handle = GetStdHandle( STD_OUTPUT_HANDLE ); if (GetConsoleMode( handle, &mode )) std::cout.rdbuf( new Output( handle ) );
      handle = GetStdHandle( STD_ERROR_HANDLE  ); if (GetConsoleMode( handle, &mode )) std::cerr.rdbuf( new Output( handle ) );
    }
    
    } // namespace duthomhas::utf8::console
    

    Now in your main(), make sure to initialize:

    int main(...)
    {
      duthomhas::utf8::console::initialize();
    
      // Ask the user to "Enter some Greek"
      std::cout << "Βάλε λίγα ελληνικά: ";
      std::string s;
      getline( std::cin, s );
      std::cout << "Good job! You entered: " << s << "!\n";
    

    Again, this always works — because it bypasses the usual char-is-a-byte handling and uses Windows’ UTF-16 handling directly under the hood — but only if you are actually attached to the console!

    ⟶ Do remember, though, that the Windows console cannot handle anything outside the BMP. Redirected file I/O still works with the full Unicode set.

    Unicode code point ≠ one character ≠ one byte

    A full Unicode code point is 21 bits, and is typically stored in a 32-bit integer object (such as a char32_t).

    If you wish to handle UTF-8 you can no longer treat “characters” as byte values. A single character can properly be one to four bytes long, and every successive character may be a different number of bytes.

    That, and a single “character glyph” may be composed of more than one code point!

    tl;dr: everything is a string.

    Almost everything you will want to do with UTF-8 can be handled as substrings, and you should structure your code as such.

    If you plan to do anything with the UTF-8 data, you should take a look at ICU.
    Here is another answer I wrote specifically about using ICU.

    ⟶ ICU also includes functions for interacting with the console, but they are not out-of-the-box-supported on Windows — again due to compiler/version/etc.