I have a problem with an xml that contains special characters (the problematic string is löööschee`*‘‘§a). The xml comes as an XOM Object in Java. While investigating the problem I tried to print out the text of the xml with a serializer. I noticed that streaming directly to System.out was the only way to get the correct string.
Here is the code I used for printing out the xml:
Element pEntry; //this is the XOM object I get, it contains the xml
Document document = pEntry.getDocument();
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Serializer serializer = new Serializer(stream);
Serializer serializer2 = new Serializer(System.out);
try {
serializer.write(document);
serializer2.write(document);
} catch (IOException e) {
System.out.println(e.getMessage());
}
System.out.println("#####################################################################");
System.out.println(stream);
So serializer2 writes directly to System.out, there the string is as it should be.
The System.out.println prints the string as l??????schee`*????????a. I tried many different things with different encodings (the standard encoding for the serializer is "UTF-8" which seems correct), but the only way I found, that prints out the correct string is directly streaming to System.out.
I also printed the bytes of the first stream, that does not work and this was the output:
6c ffffffc3 ffffffb6 ffffffc3 ffffffb6 ffffffc3 ffffffb6 73 63 68 65 65 60 2a ffffffe2 ffffff80 ffffff98 ffffffe2 ffffff80 ffffff98 ffffffc2 ffffffa7 61.
I don't really know if this is correct and I can't print out the bytes that are streaming directly to System.out. I saw that c3 b6 for example should be an ö, which would be correct, but I don't know about the ffffffs.
Why are they different, even if they use the same encoding?
Other things I tried:
System.out.println(stream)
putting
String xmlContent = stream.toString(StandardCharsets.UTF_8);
System.out.println(xmlContent);
-> this was at least an improvement I think, the string then looked like l???schee`*?????aPutting the line System.setOut(new PrintStream(System.out, true, StandardCharsets.UTF_8));
above the console output solved the problem, now the console is always showing the correct string.