javaxmlstringescapeutils

Correct xml escaping in Java


I need to convert CSV into XML and then to OutputStream. Rule is to convert " into " in my code.

Input CSV row:

{"Test":"Value"}

Expected output:

<root>
<child>{&quot;Test&quot;:&quot;Value&quot;}</child>
<root>

Current output:

<root>
<child>{&amp;quot;Test&amp;quot;:&amp;quot;Value&amp;quot;}</child>
<root>

Code:

File file = new File(FilePath);
BufferedReader reader = null;

DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();

Document newDoc = domBuilder.newDocument();
Element rootElement = newDoc.createElement("root");
newDoc.appendChild(rootElement);

reader = new BufferedReader(new FileReader(file));
String text = null;

    while ((text = reader.readLine()) != null) {
            Element rowElement = newDoc.createElement("child");
            rootElement.appendChild(rowElement);
            text = StringEscapeUtils.escapeXml(text);
            rowElement.setTextContent(text);
            }

ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(newDoc);
Result outputTarget = new StreamResult(outputStream);
TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);
System.out.println(new String(baos.toByteArray()))

Could you please help? What I miss and when & convert to &amp;?


Solution

  • The XML library will automatically escape strings that need to be XML-escaped, so you don't need to manually escape using StringEscapeUtils.escapeXml. Simply remove that line and you should get exactly what you're looking for properly-escaped XML.

    XML doesn't require " characters to be escaped everywhere, only within attribute values. So this is valid XML already:

    <root>
    <child>{"Test":"Value"}</child>
    <root>
    

    You would escape the quotes if you had an attribute that contained a quote, such as: <child attr="properly &quot;ed"/>

    This is one of the main reasons to use an XML library: the subtleties of quoting are already handled for you. No need to read the XML spec to make sure you got the quoting rules correct.