c++cutf-8character-encodingmbcs

Logic behind converting a character to UTF-8


I have the following piece of code which the comment in code says it converts any character greater than 7F to UTF-8. I have the following questions on this code:

if((const unsigned char)c > 0x7F)
  {
    Buffer[0] = 0xC0 | ((unsigned char)c >> 6);
    Buffer[1] = 0x80 | ((unsigned char)c & 0x3F);
    return Buffer;
  }
  1. How does this code work?
  2. Does the current windows code page I am using has any effect on the character placed in Buffer?

Solution

  • For starters, the code doesn't work, in general. By coincidence, it works if the encoding in char (or unsigned char) is ISO-8859-1, because ISO-8859-1 has the same code points as the first 256 Unicode code points. But ISO-8859-1 has largely been superceded by ISO-8859-15, so it probably won't work. (Try it for 0xA4, for example. The Euro sign in ISO-8859-15. It will give you a completely different character.)

    There are two correct ways to do this conversion, both of which depend on knowing the encoding of the byte being entered (which means that you may need several versions of the code, depending on the encoding). The simplest is simply to have an array with 256 strings, one per character, and index into that. In which case, you don't need the if. The other is to translate the code into a Unicode code point (32 bit UTF-32), and translate that into UTF-8 (which can require more than two bytes for some characters: the Euro character is 0x20AC: 0xE2, 0x82, 0xAC).

    EDIT:

    For a good introduction to UTF-8: http://www.cl.cam.ac.uk/~mgk25/unicode.html. The title says it is for Unix/Linux, but there is very little, if any, system specific information in it (and such information is clearly marked).