javaunicodecharacter-encoding

Encode a codepoint


I have a Unicode codepoint, which could be anything: possibly ASCII, possibly something in the BMP, and possibly an exotic emoji such as U+1F612.

I expected there would be an easy way to take a codepoint and encode it into a byte array, but I can't find a simple way. I can turn it into a String, and then encode it, but that is a round-about way involving first encoding it to UTF-16 and then re-encoding it to the required encoding. I'd like to encode it directly to bytes.

public static byte[] encodeCodePoint(int codePoint, Charset charset) {
    // Surely there's got to be a better way than this:
    return new StringBuilder().appendCodePoint(codePoint).toString().getBytes(charset);
}

Solution

  • There is really no way to avoid using UTF-16, since Java uses UTF-16 for text data, and that is what the charset convertors are designed for. But, that doesn't mean you have to use a String for the UTF-16 data:

    public static byte[] encodeCodePoint(int codePoint, Charset charset) {
        char[] chars = Character.toChars(codePoint);
        CharBuffer cb = CharBuffer.wrap(chars);
        ByteBuffer buff = charset.encode(cb);
        byte[] bytes = new byte[buff.remaining()];
        buff.get(bytes);
        return bytes;
    }