javaxmloxygenxmlxerces2-j

Xerces2-J error location accuracy: library vs Oxygen Editor


I am trying to determine if Oxygen XML Editor is using a customized version of Xerces-J parser or if there is a feature of the Xerces-J library I am not aware of. The discrepancy is the location of a validation error, where Oxygen is spot-on on the location, while Xerces-J is giving a not-so exact location.

Example XSD:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="shiporder">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="orderperson" type="xs:string"/>
        <xs:element name="shipto">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="name" type="xs:string"/>
              <xs:element name="address" type="xs:string"/>
              <xs:element name="city" type="xs:string"/>
              <xs:element name="country" type="xs:string"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name="item" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="title" type="xs:string"/>
              <xs:element name="note" type="xs:string" minOccurs="0"/>
              <xs:element name="quantity" type="xs:positiveInteger"/>
              <xs:element name="price" type="xs:decimal"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="orderid" type="xs:string" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

XML Document:

<?xml version="1.0" encoding="UTF-8"?>
<shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" orderid="889923" xsi:noNamespaceSchemaLocation="shiporder.xsd">
  <orderperson>John Smith</orderperson>
  <shipto>--- Bogus Text ---
    <name>Ola Nordmann</name>
    <address>Langgt 23</address>
    <city>4000 Stavanger</city>
    <country>Norway</country>
  </shipto>
  <item>
    <title>Empire Burlesque</title>
    <note>Special Edition</note>
    <quantity>1</quantity>
    <price>10.90</price>
  </item>
  <item>
    <title>Hide your heart</title>
    <quantity>1</quantity>
    <price>9.90</price>
  </item>
</shiporder>

When I execute schema validation using jaxp.SourceValidator sample of the Xerces-J 2.12.2 distribution, I get the following:

[Error] doc.xml:9:12: cvc-complex-type.2.3: Element 'shipto' cannot have character [children], because the type's content type is element-only.

The type of failure is correct, but the location is on the end tag of 'shipto', line 9, column 12.

When I load the same document in Oxygen Editor, I get the following location information (copied from their error dialog):

Start location: line: 4, column: 11
End location:   line: 4, column: 29

Where the engine is identified by Oxygen as Xerces. The location information is spot-on, and I would like to get such information from the Xerces-J library. Is this possible, or is Oxygen using a customized version of "Xerces", where such granular location information is not possible with the stock Xerces-J library?


Solution

  • Oxygen uses a customized patched version of Xerces and you can't get the same pinpoint error location from the standard Xerces-J library easily. The precise error location is a result of Oxygen's own architecture which builds a more rich location-aware model of the document than a standard parser does by default.

    Xerces parser with standart SAX-based pipeline reads your file as a stream. It only confirms the <shipto> element is invalid when it reaches the end tag </shipto>. So it reports the error at its current location: the end tag (line 9, column 12).

    But Oxygen first builds its own much more detailed internal data model of your document, including recording the exact start/end position of every single element. This internal model is far than a standard DOM tree. Oxygen catches Xerces's SAXParseException and instead of just printing its limited Locator data, Oxygen uses its own pre-built map to find and shows the precise location of the illegal text. (It knows it starts at line 4, column 11 and ends at line 4, column 29).

    Replicating Oxygen's behavior with stock Xerces

    It requires more work than simply calling a validator. You'd essentially have to use Oxygen's approach but on a smaller scale.

    The standard ErrorHandler interface in JAXP gives you a SAXParseException which contains the getLineNumber() and getColumnNumber() methods. But this's not enough for this type of error. To get more precise locations you'd need to:

    1. Build a location-aware DOM: first parse the XML into a DOM tree. While parsing use a mechanism to attach location information to each DOM node with setUserData method on DOM nodes. And you can create a simple LocationData class to store start/end line/column numbers

    2. Validate the DOM: once you have your location-annotated DOM tree you validate it using a javax.xml.validation.Validator and a javax.xml.transform.dom.DOMSource

    3. Correlate errors: in your custom ErrorHandler when you catch a SAXParseException, the error message will still tell you which element is at fault ("shipto"). You can then find that node in your annotated DOM tree and retrieve the precise location data for the offending child (in this case, the Node.TEXT_NODE child of the <shipto> element)