javacharacter-encodingasciiebcdic

Converting String from One Charset to Another


I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .

public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
    return new String(source, from).getBytes(to);
} 

To convert String from ASCII to EBCDIC, I have to do:

System.out.println(new String(transcodeField(ebytes,
                Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));

And to convert from EBCDIC to ASCII, I have to do:

System.out.println(new String(transcodeField(ebytes,
                Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));

Solution

  • The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:

    1. Your input data is bytes in one encoding
    2. Your output data needs to be bytes in another encoding

    In that case, it's straight forward:

    byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
    

    If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.

    However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:

    String s = new String(source.getBytes(inputEncoding), outputEncoding);

    This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).

    So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.

    Now finally some examples of good and bad usage.

    String myString = "Feng shui in chinese is 風水";
    byte[] bytes1 = myString.getBytes("UTF-8");  // Bytes correct
    byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
    
    String nordic = "Här är några merkkejä";
    byte[] bytes3 = nordic.getBytes("UTF-8");  // Bytes correct, "weird" chars take 2 bytes each
    byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
    String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
    

    The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.

    Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.

    As a last issue, Charset and Encoding aren't the same thing, but they're very much related.

    ¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.


    NOTE

    It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.

    // Input comes from network/file/other place and we have misconfigured the encoding 
    String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
    byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
    String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
    

    If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.