javaunicodecodepointsurrogate-pairssupplementary

How to read non-BMP (astral) Unicode supplementary characters (code points)


The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?

Update

I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.

It is possible to build a code point with Character.toCodePoint() but this function requires a char. On the other side reading a char is not possible because read() returns an int. My best work around so far is this but it still contains unsafe casts:

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

How to do it better?

Update 2

Another version returning a String but still using casts:

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

The remaining question: are the casts safe or how to avoid them?


Solution

  • My best work around so far is this but it still contains unsafe casts

    The only unsafe thing about the code you've presented is that ch16 might be -1 if input has reached EOF. If you check for this condition first then you can guarantee that the other (char) casts are safe as Reader.read() is specified to return either -1 or a value that is within the range of char (0 - 0xFFFF).

    public int read_code_point (Reader input) throws java.io.IOException
    {
      int ch16 = input.read();
      if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
        return ch16;
      else {
        int loSurr = input.read();
        if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) 
          return ch16; // or possibly throw an exception
        else 
          return Character.toCodePoint((char)ch16, (char)loSurr);
      }
    }
    

    This still isn't ideal, really you need to handle the edge case where the first char read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the first char as-is and backup the reader so that the next read gives you the next character. But that only works if input.markSupported() == true. If you can guarantee that then how about

    public int read_code_point (Reader input) throws java.io.IOException
    {
      int firstChar = input.read();
      if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
        return firstChar;
      } else {
        input.mark(1);
        int secondChar = input.read();
        if(secondChar < 0) {
          // reached EOF
          return firstChar;
        } else if(!Character.isLowSurrogate((char)secondChar)) {
          // unpaired surrogates, un-read the second char
          input.reset();
          return firstChar;
        }
        else {
          return Character.toCodePoint((char)firstChar, (char)secondChar);
        }
      }
    }
    

    Or you could wrap the original reader in a PushbackReader and use unread(secondChar)