javaspringbyte

What's the number of bytes does `char` occupied in JAVA


When I use JAVA 8,String is saved with char[],so if i write like follow String test = "a"; i think a is one element in char[], as we know,char occupied 2byte in JAVA,so i think test.getBytes().length may be 2 but 1

String test = "a";
System.out.println(test.getBytes().length);
char c = 'c';
System.out.println(charToByte(c).length);

result is

1 2

letter occupied 1byte as we know,but a is saved as one element in char[],char occupied 2byte so i wonder where did i misunderstand


Solution

  • The basics on String

    String holds text as Unicode, and hence can combine Greek, Arabic and Korean in a single String.

    The type char holds 2 bytes, in the Unicode transfer format UTF-16. Many characters, symbols, Unicode code points will fit in 1 char, but sometimes a pair of chars is needed.

    The conversion between text (String) and binary data (byte[])

    The binary data is always encoded in some Charset. And there always is a conversion between them.

    Charset charset = Charset.defaultCharset();
    byte[] b =  s.getBytes(charset);
    String s = new String(b, charset);
    

    The number of bytes a String occupies

    The string "ruĝa" contains 4 code points, symbols, glyphs. It is stored in memory as 4 chars of 2 bytes = 8 bytes (plus a small object implementing size).

    It can be stored in binary data for some charsets:

    However recently String may use instead of a char array a byte array, with a Charset, so it can save on memory. That relies on the actual content being a single byte encoding. You should not count on this, say for dynamic strings.

    Answer

    public static int bytesInMemory(String s) {
        return s.getBytes(StandardCharsets.UTF_16).length;
    }
    

    Most code points, symbols, 2 bytes, some 4 bytes each.

    And note that é might be 2 or 4 bytes: one code point or two code points (basic letter e and zero width accent). Vietnamese can even have two accents per letter, so 3 code points.