javaunicode

How should I represent a single unicode character in Java?


I would like to represent a single Unicode character in Java. Which primitive or class that is appropriate for this?

Note that I want to be able to store any unicode character, which may be too large for a 2 byte char.


Solution

  • char is indeed 16-bit, a char corresponds to a UTF-16 code unit. Characters that don't fit in a single UTF-16 code unit (Emojis, for instance) require two chars.

    If you need to store them individually for some reason, you can use an int for that. It has sufficient room (and then some) for all of the 0x10FFFF code points currently allowed in Unicode. That's what the JDK uses, for instance in Character.codePointAt(CharSequence seq, int index) and String(int[] codePoints, int offset, int count).

    Gratuitous conversion example (live on ideone):

    String s = "šŸ˜‚";
    int emoji = Character.codePointAt(s, 0);
    String unumber = "U+" + Integer.toHexString(emoji).toUpperCase();
    System.out.println(s + "  is code point " + unumber);
    String s2 = Character.toString(emoji);
    System.out.println("Code point " + unumber + " converted back to string: " + s2);
    System.out.println("Successful round-trip? " + s.equals(s2));
    

    which outputs:

    šŸ˜‚  is code point U+1F602
    Code point U+1F602 converted back to string: šŸ˜‚
    Successful round-trip? true
    

    (Character.toString(int) is "new" in Java 11 [2018], previously you had to use String(new int[] { emoji }, 0, 1);)