Here is a snippet of a code that is using std::codecvt_utf8<>
facet to convert from wchar_t
to UTF-8. With Visual Studio 2012, my expectations are not met (see the condition at the end of the code). Are my expectations wrong? Why? Or is this a Visual Studio 2012 library issue?
#include <locale>
#include <codecvt>
#include <cstdlib>
int main ()
{
std::mbstate_t state = std::mbstate_t ();
std::locale loc (std::locale (), new std::codecvt_utf8<wchar_t>);
typedef std::codecvt<wchar_t, char, std::mbstate_t> codecvt_type;
codecvt_type const & cvt = std::use_facet<codecvt_type> (loc);
wchar_t ch = L'\u5FC3';
wchar_t const * from_first = &ch;
wchar_t const * from_mid = &ch;
wchar_t const * from_end = from_first + 1;
char out_buf[1];
char * out_first = out_buf;
char * out_mid = out_buf;
char * out_end = out_buf + 1;
std::codecvt_base::result cvt_res
= cvt.out (state, from_first, from_end, from_mid,
out_first, out_end, out_mid);
// This is what I expect:
if (cvt_res == std::codecvt_base::partial
&& out_mid == out_end
&& state != 0)
;
else
abort ();
}
The expectation here is that the out()
function output one byte of the UTF-8 conversion at a time but the middle of the if
conditional above is false with Visual Studio 2012.
What fails is the out_mid == out_end
and state != 0
conditions. Basically, I expect at least one byte to be produced and the necessary state, for next byte of the UTF-8 sequence to be producible, to be stored in the state
variable.
The standard description of partial
return code of codecvt::do_out
says exactly this:
in Table 83:
partial
not all source characters converted
In 22.4.1.4.2[locale.codecvt.virtuals]/5:
Returns: An enumeration value, as summarized in Table 83. A return value of
partial
, if(from_next==from_end)
, indicates that either the destination sequence has not absorbed all the available destination elements, or that additional source elements are needed before another destination element can be produced.
In your case, not all (zero) source characters were converted, which technically says nothing of the contents of the output sequence (the 'if' clause in the sentence is not entered), but speaking generally, "the destination sequence has not absorbed all the available destination elements" here talks about valid multibyte characters. They are the elements of the multibyte character sequence produced by codecvt_utf8
.
It would be nice to have a more explicit standard wording, but here are two circumstantial pieces of evidence:
One: the old C's wide-to-multibyte conversion function std::wcsrtombs
(whose locale-specific variants are usually called by the existing implementations of codecvt::do_out
for system-supplied locales) is defined as follows:
Conversion stops [...] when the next multibyte character would exceed the limit of len total bytes to be stored into the array pointed to by dst.
And two, look at the existing implementations of codecvt_utf8
: you've already explored Microsoft's, and here's what's in libc++: codecvt_utf8::do_out
here calls ucs2_to_utf8
on Windows and ucs4_to_utf8
on other systems, and ucs2_to_utf8 does the following (comments mine):
else if (wc < 0x0800)
{
// not relevant
}
else // if (wc <= 0xFFFF)
{
if (to_end-to_nxt < 3)
return codecvt_base::partial; // <- look here
*to_nxt++ = static_cast<uint8_t>(0xE0 | (wc >> 12));
*to_nxt++ = static_cast<uint8_t>(0x80 | ((wc & 0x0FC0) >> 6));
*to_nxt++ = static_cast<uint8_t>(0x80 | (wc & 0x003F));
}
nothing is written to the output sequence if it cannot fit a multibyte character that results from consuming one input wide character.