To convert java characters to xml entities, I can do the following for each char in a String:
buf.append("&#x"+ Integer.toHexString(c | 0x10000).substring(1) +";");
However, according to other stackoverflow questions, this only works for Unicode 3.0.
If I use a UTF-8 Reader to read in a String, then presumably that String contains the characters in a format that works up through Unicode 6.0 (because Java 7 supports Unicode 6.0 according to the javadoc).
Once I have that String, how can I write it out as xml entities? Ideally I'd use some api that would continue working as new versions of unicode come out.
Either you are not using correct terminology, or there is a great deal of confusion here.
The &#x
character reference notation just specifies a numeric codepoint; it is independent of the version of Unicode used by any reader or parser.
Your code is actually only compatible with Unicode 1.x, because it assumes a character's numeric value is less than 216. As of Unicode 2.0 that is not a correct assumption. Some characters are represented by a single Java char
, while other characters are represented by two Java char
s (known as surrogates).
I'm not sure what a "UTF-8 Reader" is. A Reader just reads char
values, and does not know about UTF-8 or any other charset, except for InputStreamReader, which uses a CharsetDecoder to translate bytes to chars using the UTF-8 encoding (or whatever encoding a particular CharsetDecoder uses).
In any event, no Reader will parse the XML &#x
character reference notation. You must use an XML parser for that.
No Reader or XML parser is affected by the Unicode version known to Java, because no Reader or XML parser consults a Unicode database in any way. The characters are just treated as numeric values as they are parsed. Whether they correspond to assigned codepoints in any Unicode version is never considered.
Finally, to write out a String as XML, you can use a Formatter:
static String toXML(String s) {
Formatter formatter = new Formatter();
int len = s.length();
for (int i = 0; i < len; i = s.offsetByCodePoints(i, 1)) {
int c = s.codePointAt(i);
if (c < 32 || c > 126 || c == '&' || c == '<' || c == '>') {
formatter.format("&#x%x;", c);
} else {
formatter.format("%c", c);
}
}
return formatter.toString();
}
As you can see, there is no code that depends on the Unicode version, because the characters are just numeric values. Whether each numeric value is an assigned Unicode codepoint is not relevant.
(My first inclination was to use the XMLStreamWriter class, but it turns out an XMLStreamWriter that uses a non-Unicode encoding such as ISO-8859-1 or US-ASCII does not properly output surrogate pairs as single character entities, as of Java 1.8.0_05.)