I stumbled across weird behaviour of encoding/decoding string. Have a look at an example:
@Test
public void testEncoding() {
String str = "\uDD71"; // {56689}
byte[] utf16 = str.getBytes(StandardCharsets.UTF_16); // {-2, -1, -1, -3}
String utf16String = new String(utf16, StandardCharsets.UTF_16); // {65533}
assertEquals(str, utf16String);
}
I would assume this test will pass, but it is not the case. Could someone explain why the encoded and decoded string is not equal to the original one?
U+DD71 is not a valid codepoint, as U+D800..U+DFFF are reserved by Unicode so as not to cause confusion with UTF-16. As such, these codepoints should never appear as valid character data. From the Unicode standard:
Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range.
This works, though:
@Test
public void testEncoding() {
String str = "\u0040";
byte[] utf16 = str.getBytes(StandardCharsets.UTF_16);
String utf16String = new String(utf16, StandardCharsets.UTF_16);
assertEquals(str, utf16String);
}
So, it's not your code at fault, but that you're trying to use a codepoint that isn't valid.