When I use JAVA 8,String is saved with char[],so if i write like follow
String test = "a";
i think a
is one element in char[],
as we know,char occupied 2byte in JAVA,so i think test.getBytes().length may be 2 but 1
String test = "a";
System.out.println(test.getBytes().length);
char c = 'c';
System.out.println(charToByte(c).length);
letter occupied 1byte as we know,but a
is saved as one element in char[],char occupied 2byte
so i wonder where did i misunderstand
String
holds text as Unicode, and hence can combine Greek, Arabic and Korean in a single String.
The type char
holds 2 bytes, in the Unicode transfer format UTF-16. Many characters, symbols, Unicode code points will fit in 1 char
, but sometimes a pair of char
s is needed.
String
) and binary data (byte[]
)The binary data is always encoded in some Charset
. And there always is a conversion between them.
Charset charset = Charset.defaultCharset();
byte[] b = s.getBytes(charset);
String s = new String(b, charset);
The string "ruĝa"
contains 4 code points, symbols, glyphs.
It is stored in memory as 4 char
s of 2 bytes = 8 bytes (plus a small object implementing size).
It can be stored in binary data for some charsets:
However recently String
may use instead of a char
array a byte
array, with a Charset, so it can save on memory. That relies on the actual content being a single byte encoding. You should not count on this, say for dynamic strings.
public static int bytesInMemory(String s) {
return s.getBytes(StandardCharsets.UTF_16).length;
}
Most code points, symbols, 2 bytes, some 4 bytes each.
And note that é
might be 2 or 4 bytes: one code point or two code points (basic letter e
and zero width accent). Vietnamese can even have two accents per letter, so 3 code points.