c++11encodingendiannesscodecvt

Wrong endian with wstring_convert


I recently discovered the <codecvt> header, so I wanted to convert between UTF-8 and UTF-16.

I use the codecvt_utf8_utf16 facet with wstring_convert from C++11. The issue I have, is when I try to convert an UTF-16 string to UTF-8, then in UTF-16 again, the endianness changes.

For this code :

#include <codecvt>  
#include <string>  
#include <locale>  
#include <iostream>  

using namespace std;  

int main(int argc, char const *argv[])
{
  wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t>
                                                convert;

  u16string utf16 = u"\ub098\ub294\ud0dc\uc624";

  cout << hex << "UTF-16\n\n";
  for (char16_t c : utf16)
    cout << "[" << c << "] ";

  string utf8 = convert.to_bytes(utf16);

  cout << "\n\nUTF-16 to UTF-8\n\n";
  for (unsigned char c : utf8)
    cout << "[" << int(c) << "] ";
  cout << "\n\nConverting back to UTF-16\n\n";

  utf16 = convert.from_bytes(utf8);

  for (char16_t c : utf16)
    cout << "[" << c << "] ";
  cout << endl;
}

I get this output :

UTF-16

[b098] [b294] [d0dc] [c624]

UTF-16 to UTF-8

[eb] [82] [98] [eb] [8a] [94] [ed] [83] [9c] [ec] [98] [a4]

Converting back to UTF-16

[98b0] [94b2] [dcd0] [24c6]

When I change the third template argument of wstring_convert to std::little_endian, the bytes are reversed.

What did I miss ?


Solution

  • It was indeed a bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66855 It will be fixed in 5.3