javaxmldomescapinginvalid-characters

Java DOM transforming and parsing arbitrary strings with invalid XML characters?


First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don't have a given invalid (or not well-formed) XML file but rather a given arbitrary Java String which may or may not contain an invalid XML character. I want to create a DOM Document containing a Text node with the given String, then transform it to a file. When the file is parsed to a DOM Document I want to get a String which is equal to the initial given String. I create the Text node with org.w3c.dom.Document#createTextNode(String data) and I get the String with org.w3c.dom.Node#getTextContent().

As you can see in https://stackoverflow.com/a/28152666/3882565 there are some invalid characters for Text nodes in a XML file. Actually there are two different types of "invalid" characters for Text nodes. There are predefined entities such as ", &, ', < and > which are automatically escaped by the DOM API with &quot;, &amp;, &apos;, &lt; and &gt; in the resulting file which is undone by the DOM API when the file is parsed. Now the problem is that this is not the case for other invalid characters such as '\u0000' or '\uffff'. An exception occurs when parsing the file because '\u0000' and '\uffff' are invalid characters.

Probably I have to implement a method which escapes those characters in the given String in a unique way before submitting it to the DOM API and undo that later when I get the String back, right? Is there a better way to do this? Did someone implement those or similar methods in the past?

Edit: This question was marked as duplicate of Best way to encode text data for XML in Java?. I have now read all of the answers but none of them solves my problem. All of the answers suggest:


Solution

  • As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

        String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
        Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
        Element element = document.createElement("element");
        element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
        document.appendChild(element);
        TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
        // creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text&lt;text&amp;text##</element>
        document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
        System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
        // prints true
    

    escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string):

    /**
     * Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
     * DOM API already escapes predefined entities, such as {@code "}, {@code &},
     * {@code '}, {@code <} and {@code >} for
     * <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
     * code points are ignored by this function. However, there are some other
     * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
     * invalid in their escaped form, such as {@code "&#0;"}.
     * <p>
     * This function replaces all {@code '#'} by {@code "##"} and all Unicode code
     * points that are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
     * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
     * {@code "#c;"}, where <code>c</code> is the Unicode code point.
     *
     * @param string the <code>{@link String}</code> to be escaped
     * @return the escaped <code>{@link String}</code>
     * @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
     */
    public static final String escapeInvalidXmlCharacters(String string) {
        if (string == null) {
            throw new IllegalArgumentException("string cannot be null");
        }
    
        StringBuilder stringBuilder = new StringBuilder();
    
        for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
            codePoint = string.codePointAt(i);
    
            if (codePoint == '#') {
                stringBuilder.append("##");
            } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
                stringBuilder.appendCodePoint(codePoint);
            } else {
                stringBuilder.append("#" + codePoint + ";");
            }
        }
    
        return stringBuilder.toString();
    }
    
    /**
     * Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
     * Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
     *
     * @param string the <code>{@link String}</code> to be unescaped
     * @return the unescaped <code>{@link String}</code>
     * @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
     */
    public static final String unescapeInvalidXmlCharacters(String string) {
        if (string == null) {
            throw new IllegalArgumentException("string cannot be null");
        }
    
        StringBuilder stringBuilder = new StringBuilder();
        boolean escaped = false;
    
        for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
            codePoint = string.codePointAt(i);
    
            if (escaped) {
                stringBuilder.appendCodePoint(codePoint);
                escaped = false;
            } else if (codePoint == '#') {
                StringBuilder intBuilder = new StringBuilder();
                int j;
    
                for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
                    codePoint = string.codePointAt(j);
    
                    if (codePoint == ';') {
                        escaped = true;
                        break;
                    }
    
                    if (codePoint >= 48 && codePoint <= 57) {
                        intBuilder.appendCodePoint(codePoint);
                    } else {
                        break;
                    }
                }
    
                if (escaped) {
                    try {
                        codePoint = Integer.parseInt(intBuilder.toString());
                        stringBuilder.appendCodePoint(codePoint);
                        escaped = false;
                        i = j;
                    } catch (IllegalArgumentException e) {
                        codePoint = '#';
                        escaped = true;
                    }
                } else {
                    codePoint = '#';
                    escaped = true;
                }
            } else {
                stringBuilder.appendCodePoint(codePoint);
            }
        }
    
        return stringBuilder.toString();
    }
    

    Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.