I am trying to parse an xml which contains hex value of 𝓅
. This represents the mathematical symbol 𝓅. The output that I am getting is ��
.
What am I doing wrong?
example input xml :
<?xml version="1.0" encoding="UTF-8"?>
<root>
<data>𝓅</data>
</root>
output :
<?xml version="1.0" encoding="UTF-8"?>
<root>
<data>��</data>
</root>
Code to obtain XML reader :
factory = org.apache.xerces.jaxp.SAXParserFactoryImpl.newInstance();
final XMLReader xmlReader;
xmlReader = factory.newSAXParser().getXMLReader();
I am using UTF-8 encoding to decode while parsing.
The code I am using to read and write xml is this method :
public void readAndWriteXml(InputSource inputSource, OutputStream out) throws IOException, SAXException, ParserConfigurationException {
XMLReader xmlReader = getXmlReader();
Serializer serializer = SerializerFactory.getSerializer(configProps);
serializer.setOutputStream(out);
xmlReader.setContentHandler(serializer.asContentHandler());
if(logger != null){
getLogger().debug("starting xml parsing" + LocalTime.now());
}
xmlReader.parse(inputSource);
if(logger != null){
getLogger().debug("end xml parsing" + LocalTime.now());
}
}
getXMLReader() is this :
final XMLReader xmlReader;
xmlReader = factory.newSAXParser().getXMLReader();
xmlReader.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
xmlReader.setFeature("http://xml.org/sax/features" +
"/namespaces", true);
xmlReader.setFeature("http://xml.org/sax/features/external-parameter-entities", true);
// xmlReader.setFeature("http://xml.org/sax/features/validation", true);
xmlReader.setEntityResolver(wrappedEntityResolver);
xmlReader.setErrorHandler(new SaxErrorHandler());
return xmlReader;
Here I am initialising the class :
public XmlNormalizer(String catalogPath) throws IOException {
// We want the Apache XML parser, not the embedded Oracle Java version.
factory = org.apache.xerces.jaxp.SAXParserFactoryImpl.newInstance();
factory.setNamespaceAware(true);
List<Path> catalogFiles = this.findByFileName(new File(catalogPath).toPath(), CATALOG_FILENAME_PATTERN);
String[] catalogArray = catalogFiles.stream().map(Path::toString).toArray(String[]::new);
configProps = OutputPropertiesFactory.getDefaultMethodProperties("xml");
XMLCatalogResolver xmlCatalogResolver = new XMLCatalogResolver(catalogArray, true);
wrappedEntityResolver = new WrappedEntityResolver(xmlCatalogResolver);
}
WrappedEntityResolver is just a wrapper around import org.apache.xerces.util.XMLCatalogResolver;
That output is most definitely wrong, but it's hard to tell why.
What are the properties passed to the serializer?
If you serialize with Saxon, then with default encoding (UTF-8) the output is
<?xml version="1.0" encoding="UTF-8"?><root>
<data>𝓅</data>
</root>
while with encoding=us-ascii the output is:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<data>𝓅</data>
</root>