So I know about String#codePointAt(int)
, but it's indexed by the char
offset, not by the codepoint offset.
I'm thinking about trying something like:
String#charAt(int)
to get the char
at an indexchar
is in the high-surrogates range
String#codePointAt(int)
to get the codepoint, and increment the index by 2char
value as the codepoint, and increment the index by 1But my concerns are
char
values or oneYes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.
If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}