I'm working on some code for generating random strings. The resulting string appears to contain invalid char
combinations. Specifically, I find high surrogates which are not followed by a low surrogate.
Can anyone explain why this is happening? Do I have to explicitly generate a random low surrogate to follow a high surrogate? I had assumed this wasn't needed, as I was using the int
variants of the Character
class.
Here's the test code, which on a recent run produced the following bad pairings:
Bad pairing: d928 - d863 Bad pairing: da02 - 7bb6 Bad pairing: dbbc - d85c Bad pairing: dbc6 - d85c
public static void main(String[] args) {
Random r = new Random();
StringBuilder builder = new StringBuilder();
int count = 500;
while (count > 0) {
int codePoint = r.nextInt(Character.MAX_CODE_POINT + 1);
if (!Character.isDefined(codePoint)
|| Character.getType(codePoint) == Character.PRIVATE_USE) {
continue;
}
builder.appendCodePoint(codePoint);
count--;
}
String result = builder.toString();
// Test the result
char lastChar = 0;
for (int i = 0; i < result.length(); i++) {
char c = result.charAt(i);
if (Character.isHighSurrogate(lastChar) && !Character.isLowSurrogate(c)) {
System.out.println(String.format("Bad pairing: %s - %s",
Integer.toHexString(lastChar), Integer.toHexString(c)));
}
lastChar = c;
}
}
It's possible to randomly generate high or low surrogates. If this results in a low surrogate, or a high surrogate not followed by a low surrogate, the resulting string is invalid. The solution is to simply exclude all surrogates:
if (!Character.isDefined(codePoint)
|| (codePoint <= Character.MAX_CHAR && Character.isSurrogate((char)codePoint))
|| Character.getType(codePoint) == Character.PRIVATE_USE) {
continue;
}
Alternatively, it should work to only look at the type returned from getType
:
int type = Character.getType(codePoint);
if (type == Character.PRIVATE_USE ||
type == Character.SURROGATE ||
type == Character.UNASSIGNED)
continue;
(Technically, you could also allow randomly generated high surrogates and add another random low surrogate, but this would only create other random code points >= 0x10000 which might in turn be undefined or for private use.)