I recently discovered the <codecvt>
header, so I wanted to convert between UTF-8 and UTF-16.
I use the codecvt_utf8_utf16
facet with wstring_convert
from C++11.
The issue I have, is when I try to convert an UTF-16 string to UTF-8, then in UTF-16 again, the endianness changes.
For this code :
#include <codecvt>
#include <string>
#include <locale>
#include <iostream>
using namespace std;
int main(int argc, char const *argv[])
{
wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t>
convert;
u16string utf16 = u"\ub098\ub294\ud0dc\uc624";
cout << hex << "UTF-16\n\n";
for (char16_t c : utf16)
cout << "[" << c << "] ";
string utf8 = convert.to_bytes(utf16);
cout << "\n\nUTF-16 to UTF-8\n\n";
for (unsigned char c : utf8)
cout << "[" << int(c) << "] ";
cout << "\n\nConverting back to UTF-16\n\n";
utf16 = convert.from_bytes(utf8);
for (char16_t c : utf16)
cout << "[" << c << "] ";
cout << endl;
}
I get this output :
UTF-16
[b098] [b294] [d0dc] [c624]
UTF-16 to UTF-8
[eb] [82] [98] [eb] [8a] [94] [ed] [83] [9c] [ec] [98] [a4]
Converting back to UTF-16
[98b0] [94b2] [dcd0] [24c6]
When I change the third template argument of wstring_convert
to std::little_endian
, the bytes are reversed.
What did I miss ?
It was indeed a bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66855 It will be fixed in 5.3