javaunicodecharacter-encoding

Java Unicode encoding


A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?

Does this boil down to what character encoding you are using?


Solution

  • You can handle them all if you're careful enough.

    Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

    See https://www.oracle.com/technical-resources/articles/javase/supplementary.html for how to handle those characters in Java.

    (BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)