I have the following piece of code which the comment in code says it converts any character greater than 7F
to UTF-8. I have the following questions on this code:
if((const unsigned char)c > 0x7F)
{
Buffer[0] = 0xC0 | ((unsigned char)c >> 6);
Buffer[1] = 0x80 | ((unsigned char)c & 0x3F);
return Buffer;
}
Buffer
?For starters, the code doesn't work, in general. By
coincidence, it works if the encoding in char
(or unsigned
char
) is ISO-8859-1, because ISO-8859-1 has the same code
points as the first 256 Unicode code points. But ISO-8859-1 has
largely been superceded by ISO-8859-15, so it probably won't
work. (Try it for 0xA4, for example. The Euro sign in
ISO-8859-15. It will give you a completely different
character.)
There are two correct ways to do this conversion, both of which
depend on knowing the encoding of the byte being entered (which
means that you may need several versions of the code, depending
on the encoding). The simplest is simply to have an array with
256 strings, one per character, and index into that. In which
case, you don't need the if
. The other is to translate the
code into a Unicode code point (32 bit UTF-32), and translate
that into UTF-8 (which can require more than two bytes for some
characters: the Euro character is 0x20AC: 0xE2, 0x82, 0xAC).
EDIT:
For a good introduction to UTF-8: http://www.cl.cam.ac.uk/~mgk25/unicode.html. The title says it is for Unix/Linux, but there is very little, if any, system specific information in it (and such information is clearly marked).