c++unicodeansi

ANSI to Unicode or backward conversion: how is it possible to do?


There are several functions that convert ANSI to Unicode and vice versa. Here are those functions WideCharToMultiByte, MultiByteToWideChar, A2W, W2A.

Now I don't understand how A2W and W2A work. The thing is that when you convert something to another thing than you should have two sets set A and set B so that each element in set A is mapped to one and only one element in set B uniquely. Regarding this there are several problems:

  1. ANSI is one byte and UNICODE is at least 2 byte which means not all elements in UNICODE set can be mapped to ANSI uniquely.

  2. Set ANSI and set Unicode are not strictly defined. I mean there are different encoding for both.

Hereby, my question: how we can convert them and be sure that we have not spoiled the data?


Solution

  • As others have mentioned, there is no such character set as 'ANSI'. Unfortunately, the Windows API refers to CP_ACP, the 'ANSI code page', which refers to one of several character sets depending on which non-unicode locale is selected on your machine.

    That said, with regards to your original question, no, you cannot always round trip between CP_ACP and a unicode encoding. There's no equivalent for あ in CP_ACP on an English-locale windows system, for example.

    When this happens, WideCharToMultiByte will replace the character that has no equivalent with lpDefaultChar, if set, and set *lpUsedDefaultChar to true. You can pass a pointer to a boolean variable in lpUsedDefaultChar and check it after calling to see if your string contained non-translatable characters. The other direction, MultiByteToWideChar never fails as long as the input is valid in your local codepage, however. To try to detect invalid text, pass in the MB_ERR_INVALID_CHARS flag and check for an error - that said, just because the text is in some other codepage, doesn't mean you'll get an error from it (it's hard to tell if the text is actually invalid, or if it's merely gibberish).