c++unicodemultibyte

How do I use CharNext in the Windows API properly?


I have a multi-byte string containing a mixture of japanese and latin characters. I'm trying to copy parts of this string to a separate memory location. Since it's a multi-byte string, some of the characters uses one byte and other characters uses two. When copying parts of the string, I must not copy "half" japanese characters. To be able to do this properly, I need to be able to determine where in the multi-byte string characters starts and ends.

As an example, if the string contains 3 characters which requires [2 byte][2 byte][1 byte], I must copy either 2, 4 or 5 bytes to the other location and not 3, since if I were copying 3 I would copy only half the second character.

To figure out where in the multi-byte string characters starts and ends, I'm trying to use the Windows API function CharNext and CharNextExA but without luck. When I use these functions, they navigate through my string one byte at a time, rather than one character at a time. According to MSDN, CharNext is supposed to The CharNext function retrieves a pointer to the next character in a string..

Here's some code to illustrate this problem:

#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

/* string consisting of six "asian" characters */
wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644";

int main() 
{
   // Convert the asian string from wide char to multi-byte.
   LPSTR mbString = new char[1000];
   WideCharToMultiByte( CP_UTF8, 0, wcsString, -1, mbString, 100,  NULL, NULL);

   // Count the number of characters in the string.
   int characterCount = 0;
   LPSTR currentCharacter = mbString;
   while (*currentCharacter)
   {
      characterCount++;

     currentCharacter = CharNextExA(CP_UTF8, currentCharacter, 0);
   }
}

(please ignore memory leak and failure to do error checking.)

Now, in the example above I would expect that characterCount becomes 6, since that's the number of characters in the asian string. But instead, characterCount becomes 18 because mbString contains 18 characters:

門阜陀阿阻附

I don't understand how it's supposed to work. How is CharNext supposed to know whether "é–€é" in the string is an encoded version of a Japanese character, or in fact the characters é – € and é?

Some notes:

EDIT: Apparantly the CharNext functions doesn't support UTF-8 but Microsoft forgot to document that. I threw/copiedpasted together my own routine, which I won't use and which needs improving. I'm guessing it's easily crashable.

  LPSTR CharMoveNext(LPSTR szString)
  {
     if (szString == 0 || *szString == 0)
        return 0;

     if ( (szString[0] & 0x80) == 0x00)
        return szString + 1;
     else if ( (szString[0] & 0xE0) == 0xC0)
        return szString + 2;
     else if ( (szString[0] & 0xF0) == 0xE0)
        return szString + 3;
     else if ( (szString[0] & 0xF8) == 0xF0)
        return szString + 4;
     else
        return szString +1;
  }

Solution

  • Here is a really good explanation of what is going on here at the Sorting it All Out blog: Is CharNextExA broken?. In short, CharNext is not designed to work with UTF8 strings.