javaunicodesurrogate-pairs

How to generate a random Unicode string including supplementary characters?


I'm working on some code for generating random strings. The resulting string appears to contain invalid char combinations. Specifically, I find high surrogates which are not followed by a low surrogate.

Can anyone explain why this is happening? Do I have to explicitly generate a random low surrogate to follow a high surrogate? I had assumed this wasn't needed, as I was using the int variants of the Character class.

Here's the test code, which on a recent run produced the following bad pairings:

Bad pairing: d928 - d863
Bad pairing: da02 - 7bb6
Bad pairing: dbbc - d85c
Bad pairing: dbc6 - d85c
public static void main(String[] args) {
  Random r = new Random();
  StringBuilder builder = new StringBuilder();

  int count = 500;
  while (count > 0) {
    int codePoint = r.nextInt(Character.MAX_CODE_POINT + 1);

    if (!Character.isDefined(codePoint)
        || Character.getType(codePoint) == Character.PRIVATE_USE) {
      continue;
    }

    builder.appendCodePoint(codePoint);
    count--;
  }

  String result = builder.toString();

  // Test the result
  char lastChar = 0;
  for (int i = 0; i < result.length(); i++) {
    char c = result.charAt(i);
    if (Character.isHighSurrogate(lastChar) && !Character.isLowSurrogate(c)) {
      System.out.println(String.format("Bad pairing: %s - %s",
          Integer.toHexString(lastChar), Integer.toHexString(c)));
    }
    lastChar = c;
  }
}

Solution

  • It's possible to randomly generate high or low surrogates. If this results in a low surrogate, or a high surrogate not followed by a low surrogate, the resulting string is invalid. The solution is to simply exclude all surrogates:

    if (!Character.isDefined(codePoint)
        || (codePoint <= Character.MAX_CHAR && Character.isSurrogate((char)codePoint))
        || Character.getType(codePoint) == Character.PRIVATE_USE) {
      continue;
    }
    

    Alternatively, it should work to only look at the type returned from getType:

    int type = Character.getType(codePoint);
    if (type == Character.PRIVATE_USE ||
        type == Character.SURROGATE ||
        type == Character.UNASSIGNED)
        continue;
    

    (Technically, you could also allow randomly generated high surrogates and add another random low surrogate, but this would only create other random code points >= 0x10000 which might in turn be undefined or for private use.)