Trying to use escape sequences to construct a char8_t
string (to not rely on file/compiler encoding), I got issue with MSVC.
I wonder if it is a bug, or if it is implemention dependent.
Is there a workaround?
constexpr char8_t s1[] = u8"\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
constexpr unsigned char s2[] = "\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
//constexpr char8_t s3[] = u8"コ ン ニ チ ハ";
static_assert(std::equal(std::begin(s1), std::end(s1),
std::begin(s2), std::end(s2))); // Fail on msvc
Note:
Final goal is to replace std::filesystem::u8path(s2)
(std::filesystem::u8path is deprecated since C++20) by std::filesystem::path(s1)
;
This is a bug in MSVC that I expect to be fixed at some point during Microsoft's implementation of C++23.
Historically, numeric escape sequences in character and string literals were not well specified in the C++ standard and this lead to a number of core issues. These issues were addressed by P2029; a paper adopted for C++23 in November of 2020. The reported MSVC bug (along with an additional one related to non-encodeable characters) is discussed in the "Implementation impact" section of the paper.
As mentioned by other commenters, use of universal-character-names (UCNs) like \u1234
would be the preferred solution to avoid a dependency on source file encoding.