I want to write an XML file containing non-BMP characters with UTF-8 encoding.
With the following code, the generated XML file replaces non-BMP Unicode characters with numeric character references.
package xml;
import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class XMLClass {
static String names[] = {"𠀋一郎", "𠮷野","辻󠄀","👨👩👦"};
public static void main(String[] args) {
DocumentBuilder documentBuilder = null;
try {
documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
Document document = documentBuilder.newDocument();
document.setXmlStandalone(true);
Element list = document.createElement("list");
document.appendChild(list);
for (int i = 0; i < names.length; i++) {
Element name = (Element) document.createElement("name").cloneNode(false);
list.appendChild(name);
name.appendChild(document.createTextNode(names[i]));
}
File file = new File("NameList.xml");
write(file, document);
}
public static boolean write(File file, Document document) {
Transformer transformer = null;
try {
TransformerFactory transformerFactory = TransformerFactory.newInstance();
transformer = transformerFactory.newTransformer();
} catch (TransformerConfigurationException e) {
e.printStackTrace();
return false;
}
transformer.setOutputProperty("indent", "yes");
// non-BMP characters written in characters (no numeric character reference style)
// when you set encoding UTF-16
transformer.setOutputProperty("encoding", "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xalan}indent-amount", "2");
try {
transformer.transform(new DOMSource(document), new StreamResult(
file));
} catch (TransformerException e) {
e.printStackTrace();
return false;
}
return true;
}
}
What I expected is:
<?xmlversion="1.0"encoding="UTF-8"?><list>
<name>𠀋一郎</name>
<name>𠮷野</name>
<name>辻󠄀</name>
<name>👨👩👦</name>
</list>
But what I got is:
<?xmlversion="1.0"encoding="UTF-8"?><list>
<name>𠀋一郎</name>
<name>𠮷野</name>
<name>辻󠄀</name>
<name>👨👩👦</name>
</list>
How can I prevent javax.xml.transform.Transformer
from replacing non-BMP Unicode characters with numeric character references when I specify to use UTF-8 encoding?
Based on @Michael Kay's comment, switching the XSLT processor from Xalan to Saxon solves this problem. You have to add saxon-he-12.5.jar
and xmlresolver-5.2.2.jar
from Saxon-HE 12.5 to the classpath, and set the system property with -Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl
. This solution has the advantage of requiring no changes to the source code.