javaxmlutf-8character-encodingdom4j

DOM4J utf-8 encoding Umlaute(Ä,ü,ß) incorrectly


I'm using DOM4j for parsing and writing an XML-Tree which is always in UTF-8.

My XML file includes German Special-Characters. Parsing them is not a problem, but when I'm writing the tree to a file, the special characters are getting converted to � characters.

I can't change the encoding of the XML file as it is restricted to UTF-8.

Code

SAXReader xmlReader = new SAXReader();
xmlReader.setEncoding("UTF-8");

Document doc = xmlReader.read(file);
doc.setXMLEncoding("UTF-8");
Element root = doc.getRootElement();

// manipulate doc

OutputFormat format = new OutputFormat();

format.setEncoding("UTF-8");

XMLWriter writer = new XMLWriter(new FileWriter(file), format);

writer.write(doc);
writer.close();

Expected output

... 
<statementText>This is a test!Ä Ü ß</statementText>
...

Actual output

...
<statementText>This is a test!� � �</statementText>
...

Solution

  • You are passing a FileWriter to the XMLWriter. A Writer already handles String or char[] data, so it already handles the encoding, which means the XMLWriter has no chance of influencing it.

    Additionally FileWriter is an especially problematic Writer type, since you can never specify which encoding it should use, instead it always uses the platform default encoding (which is often something like ISO-8859-1 on Windows and UTF-8 on Linux). It should basically never be used for this reason.

    To let the XMLWriter apply what it is given as configuration pass it an OutputStream instead (which handles byte[]). The most obvious one to use here would be FileOutputStream:

    XMLWriter writer = new XMLWriter(new FileOutputStream(file), format);
    

    This is even documented in the JavaDoc for XMLWriter:

    Warning: using your own Writer may cause the writer's preferred character encoding to be ignored. If you use encodings other than UTF8, we recommend using the method that takes an OutputStream instead.

    Arguably the warning is a bit misleading, as the Writer can be problematic even if you intend to write UTF-8 data.