javastringchararray

Encoding behavior difference either using char buffer alone or by converting to byte array char-by-char


I am developing a Java application where I get a value of type char[] from an external C++ dll. There are cases in which non-ASCII values are expected to be input. In such a case, it works normally when I construct a String by only passing it a byte[] which is converted from the hex-string interpretation of the input value. On the other hand, I had problem when I construct a String by passing a character array which is made up from a for-loop in which each byte is cast to char, one-by-one.

In the example below, a char[] variable is obtained from the aforementioned dll where the input is a string with the value "çap" but comes with a hex-string value of C3A76170.

// the StringUtil.toByteArray function converts hex-string to a byte array
byte[] byteArray = StringUtil.toByteArray("C3A76170");

Below example yields the expected result:

String s1 = new String(byteArray);

// print
System.out.println(s1)
çap

Below example does not yield the expected result:

char[] chars = new char[byteArray.length];
for (int i = 0; i < chars.length; i++) {
    chars[i] = (char) byteArray[i];
}
String s2 = new String(chars);

// print
System.out.println(s2);
ᅢᄃap

In the second example, the output is "ᅢᄃap" (where the character "ç" is apparently misinterpret as a different character).

What can cause this discrepancy between outputs? What is the reasoning behind this behavior?


Solution

  • C and C++ use the char type to represent a single byte. However, byte and char are not the same thing in Java. Unicode has over 100,000 codepoints, so obviously a single byte is not capable of representing all characters. There is no choice other than using multiple bytes to represent some characters.

    The exact method for using multiple bytes to represent a single character is known as a Charset, also known as a character encoding (or sometimes just “encoding”).

    The most popular charset is UTF-8, because it is a compact representation of Latin languages and because it is compatible with ASCII. Your C++ library returned "çap" as a UTF-8 byte sequence.

    When your code does new String(byteArray), it is using a Charset to translate the bytes to characters. In modern versions of Java, that Charset is always UTF-8. (Older versions of Java will use the system’s default charset, which happens to be UTF-8 on all systems other than Windows.)

    When your code does (char) byteArray[i], it is forcing each byte to act as its own character, ignoring the possibility of multi-byte sequences. ç is represented in UTF-8 as the two bytes 0xc3 0xa7. The two bytes are not separate characters; together they represent a single char.

    It is almost never correct to assume one byte is equivalent to one character.

    (Also, feel free to read the obligatory Joel blog on the subject.)