javaregexxmlsupplementary

Java RegEx matcher breaks characters outside the BMP


I'm currently writing a util class to sanitize input, that is saved to an xml document. Sanitizing for us means, that all illegal characters (https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.0) are just removed from the string.

I tried to do this by just using some regex, that replaces all invalid characters with an empty string, but for unicode characters outside the BMP, this seems to break the encoding somehow, leaving me with those ? characters. It also does not seem to matter which way of replacing by regexp I use (String#replaceAll(String, String), Pattern#compile(String), org.apache.commons.lang3.RegExUtil#removeAll(String, String))

Here's an example implementation with a test (in Spock), that shows the problem: XmlStringUtil.java

package com.example.util;

import lombok.NonNull;

import java.util.regex.Pattern;

public class XmlStringUtil {

    private static final Pattern XML_10_PATTERN = Pattern.compile(
        "[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]"
    );

    public static String sanitizeXml10(@NonNull String text) {
        return XML_10_PATTERN.matcher(text).replaceAll("");
    }

}

XmlStringUtilSpec.groovy

package com.example.util

import spock.lang.Specification

class XmlStringUtilSpec extends Specification {

    def 'sanitize string values for xml version 1.0'() {
        when: 'a string is sanitized'
            def sanitizedString = XmlStringUtil.sanitizeXml10 inputString

        then: 'the returned sanitized string matches the expected one'
            sanitizedString == expectedSanitizedString

        where:
            inputString                                | expectedSanitizedString
            ''                                         | ''
            '\b'                                       | ''
            '\u0001'                                   | ''
            'Hello World!\0'                           | 'Hello World!'
            'text with emoji \uD83E\uDDD1\uD83C\uDFFB' | 'text with emoji \uD83E\uDDD1\uD83C\uDFFB'
    }

}

I have now a solution, where I rebuild the whole string from its single code points, but that does not seem to be the correct solution.

Thanks in advance!


Solution

  • After some reading and experimenting, a slight change to the Regex (replacing the \x{..} with the surrogates \u...\u... works:

    private static final Pattern XML_10_PATTERN = Pattern.compile(
            "[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\uD800\uDC00-\uDBFF\uDFFF]"
        );
    

    Check:

    sanitizeXml10("\uD83E\uDDD1\uD83C\uDFFB").codePoints().mapToObj(Integer::toHexString).forEach(System.out::println);
    

    results in

    1f9d1
    1f3fb