I'm currently writing a util class to sanitize input, that is saved to an xml document. Sanitizing for us means, that all illegal characters (https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.0) are just removed from the string.
I tried to do this by just using some regex, that replaces all invalid characters with an empty string, but for unicode characters outside the BMP, this seems to break the encoding somehow, leaving me with those ?
characters. It also does not seem to matter which way of replacing by regexp I use (String#replaceAll(String, String)
, Pattern#compile(String)
, org.apache.commons.lang3.RegExUtil#removeAll(String, String)
)
Here's an example implementation with a test (in Spock), that shows the problem: XmlStringUtil.java
package com.example.util;
import lombok.NonNull;
import java.util.regex.Pattern;
public class XmlStringUtil {
private static final Pattern XML_10_PATTERN = Pattern.compile(
"[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]"
);
public static String sanitizeXml10(@NonNull String text) {
return XML_10_PATTERN.matcher(text).replaceAll("");
}
}
XmlStringUtilSpec.groovy
package com.example.util
import spock.lang.Specification
class XmlStringUtilSpec extends Specification {
def 'sanitize string values for xml version 1.0'() {
when: 'a string is sanitized'
def sanitizedString = XmlStringUtil.sanitizeXml10 inputString
then: 'the returned sanitized string matches the expected one'
sanitizedString == expectedSanitizedString
where:
inputString | expectedSanitizedString
'' | ''
'\b' | ''
'\u0001' | ''
'Hello World!\0' | 'Hello World!'
'text with emoji \uD83E\uDDD1\uD83C\uDFFB' | 'text with emoji \uD83E\uDDD1\uD83C\uDFFB'
}
}
I have now a solution, where I rebuild the whole string from its single code points, but that does not seem to be the correct solution.
Thanks in advance!
After some reading and experimenting, a slight change to the Regex (replacing the \x{..}
with the surrogates \u...\u...
works:
private static final Pattern XML_10_PATTERN = Pattern.compile(
"[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\uD800\uDC00-\uDBFF\uDFFF]"
);
Check:
sanitizeXml10("\uD83E\uDDD1\uD83C\uDFFB").codePoints().mapToObj(Integer::toHexString).forEach(System.out::println);
results in
1f9d1
1f3fb