Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.
like
line.replace(regExp,"");
what is the right regExp to use ?
invalid XML character is everything that is not this :
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
thanks.
Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars, or, even easier, use \x
to specify any valid code point.
Here is the pattern for removing characters that are illegal in XML 1.0:
// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\x{10000}-\x{10FFFF}"
+ "]";
Most people will want the XML 1.0 version.
Here is the pattern for removing characters that are illegal in XML 1.1:
// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
+ "\u0001-\uD7FF"
+ "\uE000-\uFFFD"
+ "\x{10000}-\x{10FFFF}"
+ "]+";
You will need to use String.replaceAll(...)
and not String.replace(...)
.
String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");