There are many questions on this site regarding Unicode and wchar_t
. I guess I have grasped the concept, but then found something that proves most (if not all) answers wrong if it is true. On this page, Microsoft claims that one wchar_t
character can hold any Unicode character (emphasis mine):
A wide character is a 2-byte multilingual character code. Any character in use in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification as a wide character. Developed and maintained by a large consortium that includes Microsoft, the Unicode standard is now widely accepted.
A wide character is of type wchar_t. A wide-character string is represented as a wchar_t[] array. You point to the array with a wchar_t* pointer.
Since this statement is from Microsoft directly, I am quite worried now:
How can a "two-byte multilingual character code" hold any character of the Unicode character set that already contains around 150,000 code points (characters)? [ Plus, if we take into account the private use code points, surrogates, code points that already have been reserved, and so on, it would be over a 1,000,000 code points? ]
I hope that this question is not a duplicate because its core is that Microsoft itself states something which seems to be plain wrong, and I really would like to know what I have misunderstood specifically on the Microsoft page I have linked.
By the way, then there is this page that contradicts the first one and eventually tells the truth:
Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as one or two 16-bit values.
So obviously, we sometimes need two wchar_t
characters (4 bytes) to represent Unicode code points. Well, that would make sense somehow, but given the contradicting documentation, I am complete unsure now.
If somebody is interested in how that question originated:
In one of my projects, I have a string that must have a character at a certain fixed position replaced by another character. This happens in a loop and must be done as fast as possible. This is a no-brainer with normal char[]
strings. But the string in question is of type wchar_t[]
, and I don't have control over the replacement characters.
Depending on which from the above Microsoft statements is true, this is either a no-brainer, too (if the first statement is true), but if the second statement is true, it would become quite a mess: I could not just replace the wchar_t
character at the respective index by the replacement character, because the original character might be one wchar_t
while the second might need two wchar_t
, or vice versa.
That's why I'd like know which documentation is true.
How can a "two-byte multilingual character code" hold any character of the Unicode character set that already contains around 150,000 code points (characters)?
A single 16-bit wchar_t
can represent only a small fraction of all the code point values in Unicode's 21-bit code space. There was a point early in Unicode's history when it was thought that 65536 characters (the number of distinct 16-bit values) would be enough, and that's how Java, for example, which was trying to be forward-looking, ended up with a 16-bit char
type. It didn't take long to realize, however, that 65536 was completely insufficient. As you observe, far more code points than that are assigned today.
The Microsoft statement you quoted is wrong.
However, there is a well established mechanism (UTF-16) for encoding any Unicode code point value in at most two 16-bit code units. This is what the Windows API in fact expects and uses for wide strings. It is possible that the bad statement is just out of date, or that it resulted from a bad edit. This correct variation is very similar:
Any character in use in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification via wide characters.
If one didn't read critically and / or was not well informed on the details of Unicode then they might not recognize that that doesn't mean the same as the original.
Side note:
In one of my projects, I have a string that must have a character at a certain fixed position replaced by another character. This happens in a loop and must be done as fast as possible. This is a no-brainer with normal char[] strings.
No, it isn't. Not at all. Depending on the encoding with which you were representing your strings and the range of characters to be supported, you might need to deal with characters that are not supported by your encoding, or with the fact that ordinary strings can still use variable-length encodings. In particular, UTF-8 is pretty common. Modern C even has a syntax for UTF-8 string literals.
But the string in question is of type wchar_t[], and I don't have control over the replacement characters.
Yes, unless the range of characters you must support is constrained in a suitable way, you need to deal with the possibility that one of the two characters involved in the replacement requires 2 units, but the other uses only one.