javaxsd

Even with ErrorHandler, why is schema validation ending after the first error?


I’m working on schema validation. The goal is to take an XSD file and validate an incoming document against it. If there are errors, I want to capture all of them.

I’m getting only the first error in the ErrorHandler and then the processing ends. There’s lots of internet examples of people asking this same question, and the answer seems to be always what I’m doing (create a custom error handler).

Further, the documentation for the ErrorHandler interface has this to say about how the error method is supposed to work:

/**
 * <p>The SAX parser must continue to provide normal parsing
 * events after invoking this method: it should still be possible
 * for the application to process the document through to the end.
 * If the application cannot do so, then the parser should report
 * a fatal error even if the XML recommendation does not require
 * it to do so.</p>
 */

Note that this is a Java 13 example, but there's no reason it really needs to be (other than for concise xml text definition).

private String drugValidationSchema = """
                    <?xml version="1.0" encoding="UTF-8"?>
                    <schema xmlns="http://www.w3.org/2001/XMLSchema"
                    targetNamespace="https://www.company.com/Drug"
                    xmlns:apins="https://www.company.com/Drug" elementFormDefault="qualified">

                        <element name="drugRequest" type="apins:drugRequest"></element>

                        <element name="drugResponse" type="apins:drugResponse"></element>

                        <complexType name="drugRequest">
                            <sequence>
                                <element name="id" type="int"></element>
                            </sequence>
                        </complexType>

                        <complexType name="drugResponse">
                            <sequence>
                                <element name="id" type="int"></element>
                                <element name="drugClass" type="string"></element>
                                <element name="drugName" type="string"></element>
                            </sequence>
                        </complexType>
                    </schema>
                    """;

// This document has 3 errors in it based on the schema above:
// 1) idx instead of id
// 2) dugClass instead of drugClass
// 3) dugName instead of drugName
private String badDrugResponseXml = """
                    <?xml version="1.0" encoding="UTF-8"?>
                    <apins:drugResponse xmlns:apins="https://www.company.com/Drug" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.company.com/Drug Drug.xsd ">
                      <apins:idx>1</apins:idx>
                      <apins:dugClass>opioid</apins:dugClass>
                      <apins:dugName>Buprenorphine</apins:dugName>
                    </apins:drugResponse>
                    """;

/**
 * This test does nothing but send the desired files into the validation
 * process.  The goal is for the validation process to output 3 errors.
 * For reasons I don't understand, it will only output the first one and
 * stop the processing.
 */
@Test
void testWithValidator() {
    System.out.println("Test an entry with multiple errors: " + validateXMLSchema(drugValidationSchema, badDrugResponseXml));
    Assertions.assertTrue(true);
}


/**
 * This validator process seems to always stop after the first error is encountered.
 *
 * @param xsdPath   the actual XSD content as String
 * @param xmlPath   the actual xml document text as String.
 * @return          True if there are no errors, false otherwise. (planning to return details)
 */
static boolean validateXMLSchema(String xsdPath, String xmlPath){

    try {
        SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);

        Schema schema = factory.newSchema(new StreamSource(new StringReader(xsdPath)));

        Validator validator = schema.newValidator();

        List<Exception> exceptions = new ArrayList<>();
        // Add a custom handler to the validator.  The goal is to continue processing
        // the document so that ALL errors are captured.
        validator.setErrorHandler(new ErrorHandler() {
            @Override
            public void warning(SAXParseException exception) {
                exceptions.add(exception);
            }

            @Override
            public void fatalError(SAXParseException exception) {
                exceptions.add(exception);
            }

            @Override
            public void error(SAXParseException exception) {
                exceptions.add(exception);
            }
        });

        validator.validate(new StreamSource(new StringReader(xmlPath)));

        if (exceptions.size() > 0) {
            for (Exception ex : exceptions) {
                System.out.println("Error found: " + ex.getMessage());
            }
        }else {
            System.out.println("No errors.");
        }
    } catch (SAXException | IOException e) {
        System.out.println("Exception: "+e.getMessage());
        return false;
    }
    return true;
}

As the comments suggest, it is clear in debugging that the first error is reported via the results of the custom error handler, but processing does not continue and find the subsequent two errors.


Solution

  • The answer is not straightforward, but bear with me...

    I worked with a team who implemented a fully-compliant validating XML parser. I asked them for this exact feature. They explained that an incorrect/unexpected tag name (same thing) can result from two situations:

    a) incorrect tag name in the correct position in the xsd

    b) correct tag name in an incorrect position in the xsd

    When people ask for this feature, they are almost always thinking of scenario a). The XSD is very simple (very limited variability in the XML document), and it is 'obvious' to a human reader that the unexpected tag name is a typo. Unfortunately, the XSD specification allows for many types of variability. You can have xs:any (wildcards), choice groups, unordered groups, optional elements, complex type extensions with various types of restrictions etc. If the XSD is very 'open' then it is not at all obvious that the unexpected tag name was a simple typo. Attempting to continue will be pointless in the general case because the XML parser will have no idea where to continue parsing from.

    There is one situation only where an XML processor can issue a validation error and safely continue with parsing in all circumstances. When the simple value of the tag/attribute does not comply with the xsd:facet restrictions it is OK to report the error and continue. The parser has not lost its 'context' within the XSD because the names of the elements have all been matched successfully.

    You may be tempted to refer to your example and say 'but in my case, parsing could safely continue'. You would be correct, but I don't know of any XML parser that has managed to distinguish between 'safe to continue' and 'unsafe to continue' situations for unmatched tag names.