javaxmlutf-8

How to prevent javax.xml.transform.Transformer from replacing non-BMP Unicode characters with numeric character references in UTF-8 encoding?


Background

I want to write an XML file containing non-BMP characters with UTF-8 encoding.

Problem

With the following code, the generated XML file replaces non-BMP Unicode characters with numeric character references.

package xml;

import java.io.File;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.w3c.dom.Document;
import org.w3c.dom.Element;

public class XMLClass {

    static String names[] = {"𠀋一郎", "𠮷野","辻󠄀","👨‍👩‍👦"};

    public static void main(String[] args) {

        DocumentBuilder documentBuilder = null;
        try {
            documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
        Document document = documentBuilder.newDocument();
        document.setXmlStandalone(true);

        Element list = document.createElement("list");
        document.appendChild(list);
        for (int i = 0; i < names.length; i++) {
            Element name = (Element) document.createElement("name").cloneNode(false);
            list.appendChild(name);
            name.appendChild(document.createTextNode(names[i]));
        }

        File file = new File("NameList.xml");
        write(file, document);
    }


    public static boolean write(File file, Document document) {

        Transformer transformer = null;
        try {
             TransformerFactory transformerFactory = TransformerFactory.newInstance();
             transformer = transformerFactory.newTransformer();
        } catch (TransformerConfigurationException e) {
             e.printStackTrace();
             return false;
        }

        transformer.setOutputProperty("indent", "yes");
        // non-BMP characters written in characters (no numeric character reference style)
        // when you set encoding UTF-16
        transformer.setOutputProperty("encoding", "UTF-8");
        transformer.setOutputProperty("{http://xml.apache.org/xalan}indent-amount", "2");

        try {
             transformer.transform(new DOMSource(document), new StreamResult(
                       file));
        } catch (TransformerException e) {
             e.printStackTrace();
             return false;
        }

        return true;
   }
}

What I expected is:

<?xmlversion="1.0"encoding="UTF-8"?><list>
<name>𠀋一郎</name>
<name>𠮷野</name>
<name>辻󠄀</name>
<name>👨‍👩‍👦</name>
</list>

But what I got is:

<?xmlversion="1.0"encoding="UTF-8"?><list>
<name>&#131083;一郎</name>
<name>&#134071;野</name>
<name>辻&#917760;</name>
<name>&#128104;‍&#128105;‍&#128102;</name>
</list>

Question

How can I prevent javax.xml.transform.Transformer from replacing non-BMP Unicode characters with numeric character references when I specify to use UTF-8 encoding?


Solution

  • Based on @Michael Kay's comment, switching the XSLT processor from Xalan to Saxon solves this problem. You have to add saxon-he-12.5.jar and xmlresolver-5.2.2.jar from Saxon-HE 12.5 to the classpath, and set the system property with -Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl. This solution has the advantage of requiring no changes to the source code.