javaxmlsaxsaxparserxerces

Xml Reader parsing character outside BMP to surrogate pairs which results in invalid xml


I am trying to parse an xml which contains hex value of 𝓅. This represents the mathematical symbol 𝓅. The output that I am getting is ��.

What am I doing wrong?

example input xml :

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data>&#x1d4c5;</data>
</root>

output :

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data>&#55349;&#56517;</data>
</root>

Code to obtain XML reader :

factory = org.apache.xerces.jaxp.SAXParserFactoryImpl.newInstance();
final XMLReader xmlReader;
        xmlReader = factory.newSAXParser().getXMLReader();

I am using UTF-8 encoding to decode while parsing.

The code I am using to read and write xml is this method :

public void readAndWriteXml(InputSource inputSource, OutputStream out) throws IOException, SAXException, ParserConfigurationException {

            XMLReader xmlReader = getXmlReader();
            Serializer serializer = SerializerFactory.getSerializer(configProps);
            serializer.setOutputStream(out);
            xmlReader.setContentHandler(serializer.asContentHandler());

            if(logger != null){
                getLogger().debug("starting xml parsing" + LocalTime.now());
            }
            xmlReader.parse(inputSource);
            if(logger != null){
                getLogger().debug("end xml parsing" + LocalTime.now());
            }

        }

getXMLReader() is this :

final XMLReader xmlReader;
        xmlReader = factory.newSAXParser().getXMLReader();
        xmlReader.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
        xmlReader.setFeature("http://xml.org/sax/features" +
                "/namespaces", true);
        xmlReader.setFeature("http://xml.org/sax/features/external-parameter-entities", true);
//        xmlReader.setFeature("http://xml.org/sax/features/validation", true);
        xmlReader.setEntityResolver(wrappedEntityResolver);
        xmlReader.setErrorHandler(new SaxErrorHandler());
        return xmlReader;

Here I am initialising the class :

public XmlNormalizer(String catalogPath) throws IOException {
        // We want the Apache XML parser, not the embedded Oracle Java version.
        factory = org.apache.xerces.jaxp.SAXParserFactoryImpl.newInstance();
        factory.setNamespaceAware(true);
        List<Path> catalogFiles = this.findByFileName(new File(catalogPath).toPath(), CATALOG_FILENAME_PATTERN);
        String[] catalogArray = catalogFiles.stream().map(Path::toString).toArray(String[]::new);
        configProps = OutputPropertiesFactory.getDefaultMethodProperties("xml");
        XMLCatalogResolver xmlCatalogResolver = new XMLCatalogResolver(catalogArray, true);
        wrappedEntityResolver = new WrappedEntityResolver(xmlCatalogResolver);
    }

WrappedEntityResolver is just a wrapper around import org.apache.xerces.util.XMLCatalogResolver;


Solution

  • That output is most definitely wrong, but it's hard to tell why.

    What are the properties passed to the serializer?

    If you serialize with Saxon, then with default encoding (UTF-8) the output is

    <?xml version="1.0" encoding="UTF-8"?><root>
       <data>𝓅</data>
    </root>
    

    while with encoding=us-ascii the output is:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <data>&#x1d4c5;</data>
    </root>