javacharacter-encodingbytebuffer

ByteBuffer, CharBuffer, String and Charset


I'm trying to sort out characters, their representation in byte sequences according to character sets, and how to convert from one character set to another in Java. I've some difficulties.

For instance,

ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());

My understanding is that:

Thus in this code:

Charset utf16 = Charset.forName("UTF-16");  
CharBuffer chbf = utf16.decode(bybf);  
System.out.println(chbf);  

decode() should

Actually no byte should be altered since everything is UTF-16 stored and UTF-16 Charset should be a kind of "neutral operator". However the result is printed as:

??

How can that be?

Additional question: For converting correctly, it seems Charset.decode(ByteBuffer bb) requires bb to be an UTF-16 big endian byte sequence image of a string. Is that correct?


Edit: From the answers provided, I did some testing to print a ByteBuffer content and the chars obtained by decoding it. Bytes [encoding with ="Olé".getBytes(charsetName)] are printed on first line of groups, the other line(s) are the strings obtained by decoding back the bytes [with Charset#decode(ByteBuffer)] with various Charset.

I also confirmed that the default encoding for storing String into byte[] on a Windows 7 computer is windows-1252 (unless strings contain chars requiring UTF-8).

Default VM encoding: windows-1252  
Sample string: "Olé"  


  getBytes() no CS provided : 79 108 233  <-- default (windows-1252), 1 byte per char
     Decoded as windows-1252: Olé         <-- using the same CS than getBytes()
           Decoded as UTF-16: ??          <-- using another CS (doesn't work indeed)

  getBytes with windows-1252: 79 108 233  <-- same than getBytes()
     Decoded as windows-1252: Olé

         getBytes with UTF-8: 79 108 195 169  <-- 'é' in UTF-8 use 2 bytes
            Decoded as UTF-8: Olé

        getBytes with UTF-16: 254 255 0 79 0 108 0 233 <-- each char uses 2 bytes with UTF-16
           Decoded as UTF-16: Olé                          (254-255 is an encoding tag)

Solution

  • You are mostly correct.

    The native character representation in java is UTF-16. However when converting characters to bytes you either specify the charset you are using, or the system uses it's default which has usually been UTF-8 whenever I checked. This will yield interesting results if you are mixing and matching.

    eg for my system the following

    System.out.println(Charset.defaultCharset().name());
    ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
    Charset utf16 = Charset.forName("UTF-16");
    CharBuffer chbf = utf16.decode(bybf);
    System.out.println(chbf);
    bybf = ByteBuffer.wrap("Olé".getBytes(utf16));
    chbf = utf16.decode(bybf);
    System.out.println(chbf);
    

    produces

    UTF-8
    佬쎩
    Olé

    So this part is only correct if UTF-16 is the default charset
    getBytes() result is this same UTF-16 byte sequence.

    So either always specify the charset you are using which is safest as you will always know what is going on, or always use the default.