I'm trying to parse an xml file from an external source which contains invalid UTF-8 bytes
Using the following java code
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
And I'm getting the following exception
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
... 10 common frames omitted
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:702)
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:409)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1904)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:508)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2649)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)
I realise that the XML contains an invalid UTF-8 character but I'd like the XML parser to gracefully handle this rather than throwing an exception
I solved this by passing a java.io.Reader
to the DocumentBuilder
instead of a java.io.InputStream
. So now the DocumentBuilder
is acting upon a stream of characters instead of a stream of bytes and does not attempt to validate the bytes and hence does not throw exceptions. The byte to character transformation is now done by the InputStreamReader
So I changed
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
To
try (Reader reader = new InputStreamReader(getMyInputStream(), StandardCharsets.UTF_8)) {
Document doc = documentBuilder.parse(new InputSource(reader));
...
}