javacharacter-encodingcodepointcharset

How to convert codepoint of one charset to another in Java?


I am trying to convert codepoints from one charset to another in Java.

For example character ř is 248 in windows-1250, 345 in unicode.

So I have source charset and source codepoint and target charset and want to calculate target codepoint.

This may sound easy as windows-1250 is single byte, but I want it to work on any charset, like GB2312.

I guess it can be done somehow with Charset class, but it seems that it only converts bytes, not actual code points.

Charset sourceCharset = Charset.forName("GB2312");                
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");                
int targetCodePoint = ...; //???

I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.

Thanks in advance for any help.


Solution

  • At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.

        Charset sourceCharset = Charset.forName("windows-1250");                
        int sourceCodePoint = 248; // ř
        byte[] bytes = {(byte)sourceCodePoint};
        String targetString = new String(bytes, sourceCharset);
        int targetCodePoint = targetString.codePointAt(0);
        System.out.println("targetString = " + targetString);
        System.out.println("targetCodePoint = " + targetCodePoint);
    

    output:

    targetString = ř
    targetCodePoint = 345
    

    Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.

        Charset sourceCharset = Charset.forName("GB2312");                
        int sourceCodePoint = 45257; // 吧 chinese character
        byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
        String targetString = new String(bytes, sourceCharset);
        int targetCodePoint = targetString.codePointAt(0);
        System.out.println("targetString = " + targetString);
        System.out.println("targetCodePoint = " + targetCodePoint);
    

    output:

    targetString = 吧
    targetCodePoint = 21543