javaxmlparsingutf-8xerces

Exception when parsing xml file (Invalid byte 2 of 3-byte UTF-8 sequence)


I'm trying to parse an xml file from an external source which contains invalid UTF-8 bytes

enter image description here

Using the following java code

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
try (InputStream in = getMyInputStream()) {
    Document doc = documentBuilder.parse(new InputSource(in));
    ...
}

And I'm getting the following exception

Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
    ... 10 common frames omitted
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
    at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:702)
    at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:409)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1904)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:508)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2649)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)

I realise that the XML contains an invalid UTF-8 character but I'd like the XML parser to gracefully handle this rather than throwing an exception


Solution

  • I solved this by passing a java.io.Reader to the DocumentBuilder instead of a java.io.InputStream. So now the DocumentBuilder is acting upon a stream of characters instead of a stream of bytes and does not attempt to validate the bytes and hence does not throw exceptions. The byte to character transformation is now done by the InputStreamReader

    So I changed

    try (InputStream in = getMyInputStream()) {
       Document doc = documentBuilder.parse(new InputSource(in));
       ...
    }
    

    To

    try (Reader reader = new InputStreamReader(getMyInputStream(), StandardCharsets.UTF_8)) {
       Document doc = documentBuilder.parse(new InputSource(reader));
       ...
    }