There are several functions that convert ANSI to Unicode and vice versa. Here are those functions WideCharToMultiByte
, MultiByteToWideChar
, A2W
, W2A
.
Now I don't understand how A2W
and W2A
work. The thing is that when you convert something to another thing than you should have two sets set A
and set B
so that each element in set A
is mapped to one and only one element in set B
uniquely. Regarding this there are several problems:
ANSI is one byte and UNICODE is at least 2 byte which means not all elements in UNICODE set can be mapped to ANSI uniquely.
Set ANSI
and set Unicode
are not strictly defined. I mean there are different encoding for both.
Hereby, my question: how we can convert them and be sure that we have not spoiled the data?
As others have mentioned, there is no such character set as 'ANSI'. Unfortunately, the Windows API refers to CP_ACP
, the 'ANSI code page', which refers to one of several character sets depending on which non-unicode locale is selected on your machine.
That said, with regards to your original question, no, you cannot always round trip between CP_ACP
and a unicode encoding. There's no equivalent for あ in CP_ACP
on an English-locale windows system, for example.
When this happens, WideCharToMultiByte
will replace the character that has no equivalent with lpDefaultChar
, if set, and set *lpUsedDefaultChar
to true. You can pass a pointer to a boolean variable in lpUsedDefaultChar
and check it after calling to see if your string contained non-translatable characters. The other direction, MultiByteToWideChar
never fails as long as the input is valid in your local codepage, however. To try to detect invalid text, pass in the MB_ERR_INVALID_CHARS
flag and check for an error - that said, just because the text is in some other codepage, doesn't mean you'll get an error from it (it's hard to tell if the text is actually invalid, or if it's merely gibberish).