The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char
or a int
containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?
Update
I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.
It is possible to build a code point with Character.toCodePoint()
but this function requires a char
. On the other side reading a char
is not possible because read()
returns an int
. My best work around so far is this but it still contains unsafe casts:
public int read_code_point (Reader input) throws java.io.IOException
{
int ch16 = input.read();
if (Character.isHighSurrogate((char)ch16))
return Character.toCodePoint((char)ch16, (char)input.read());
else
return (int)ch16;
}
How to do it better?
Update 2
Another version returning a String but still using casts:
public String readchar (Reader input) throws java.io.IOException
{
int i16 = input.read(); // UTF-16 as int
if (i16 == -1) return null;
char c16 = (char)i16; // UTF-16
if (Character.isHighSurrogate(c16)) {
int low_i16 = input.read(); // low surrogate UTF-16 as int
if (low_i16 == -1)
throw new java.io.IOException ("Can not read low surrogate");
char low_c16 = (char)low_i16;
int codepoint = Character.toCodePoint(c16, low_c16);
return new String (Character.toChars(codepoint));
}
else
return Character.toString(c16);
}
The remaining question: are the casts safe or how to avoid them?
My best work around so far is this but it still contains unsafe casts
The only unsafe thing about the code you've presented is that ch16
might be -1 if input
has reached EOF. If you check for this condition first then you can guarantee that the other (char)
casts are safe as Reader.read()
is specified to return either -1 or a value that is within the range of char
(0 - 0xFFFF).
public int read_code_point (Reader input) throws java.io.IOException
{
int ch16 = input.read();
if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
return ch16;
else {
int loSurr = input.read();
if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr))
return ch16; // or possibly throw an exception
else
return Character.toCodePoint((char)ch16, (char)loSurr);
}
}
This still isn't ideal, really you need to handle the edge case where the first char
read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the first char
as-is and backup the reader so that the next read gives you the next character. But that only works if input.markSupported() == true
. If you can guarantee that then how about
public int read_code_point (Reader input) throws java.io.IOException
{
int firstChar = input.read();
if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
return firstChar;
} else {
input.mark(1);
int secondChar = input.read();
if(secondChar < 0) {
// reached EOF
return firstChar;
} else if(!Character.isLowSurrogate((char)secondChar)) {
// unpaired surrogates, un-read the second char
input.reset();
return firstChar;
}
else {
return Character.toCodePoint((char)firstChar, (char)secondChar);
}
}
}
Or you could wrap the original reader in a PushbackReader
and use unread(secondChar)