xmlxpathxsltsaxonhigher-order-functions

How to construct a robust XML parser with error handling in XPath 3.1 and XSLT


My use case: i want to analyze a large XML Document which contains elements named ownedComment. Each of these Elements has an attribute called body. The content of this attribute should be a string, which is a serialized XML Document Fragment. An example would be

<ownedComment body="&lt;p>This is a &lt;i>comment&lt;/i>&lt;p>"/>

An additional complication is that the serialized document references XML entities that are defined externally elsewhere (in a file).

I am using XSLT 3 and XPath 3.1, environment is SAXON EE within Oxygen. I have successfully created a function called uml:documentation-parser that creates a parser for a particular entity definition file. It uses the closure technique (for the entity definition) and higher order functions, since it returns a function with signature function (element()) as element()*. The semantic of this function is: For a given Element $e return the content of its $e/ownedComment/@body parsed as an XML Document, taken into account the entity definitions from a particular file. The outline of this function is given below:

  <xsl:function name="uml:documentation-parser" as="function (element()) as element()*">
        <xsl:param name="entity-file" as="xs:string?"/>
        <xsl:sequence select="
                let $doctype := if ($entity-file) then
                    '&lt;!DOCTYPE root [&lt;!ENTITY % entities SYSTEM ''' || $entity-file || '''&gt;%entities;]&gt;'
                else
                    '',
                    $prolog := '&lt;root xmlns=''http://docbook.org/ns/docbook''&gt;',
                    $epilog := '&lt;/root&gt;'
                return
                    function ($element as element(*)) as element()* {
                        let $text := $element/ownedComment/@body
                        return
                            if ($text) then
                                (concat($doctype, $prolog, $text, $epilog) => parse-xml())/*/*
                            else
                                ()
                    }"/>
    </xsl:function>

$prolog and $epilog are needed, because the Document Fragment in @body may contain more than on serialized XML Elements. They guarantee that there is always a single root element, and set the namespace.

This is very well when the string within @body can be parsed as an XML Document. But the parse-xml() function may raise a dynamic error err:FODC0006 if the content is not a well-formed and namespace-well-formed XML document.

That's why i would like to change the signature of the returned function (the parser) to function (element()) as map(). The idea is that the parser should never raise an error, but always return a map with these entries:

My problem is, that there is no try/catch mechanism in XPath. It's a feature of XSLT.

My question is: is there any way in the combination of XPATH and XSLT to construct a robust XML parser as a result of o higher order function, that is able to catch dynamic errors?

Thanks in advance, Frank Steimke


Solution

  • I've written a stylesheet based on your code, and added an auxiliary function uml:parse-xml-robustly, which returns a map as you specified, and I changed your existing function so that it uses this new function in place of parse-xml(), and extracts the parsed XML (if any) from the map which it returns.

    It wasn't entirely clear what you wanted to do about errors. You said you wanted your robust parser to return a map whose error key would be associated with the error value. So I chose to return an error in the form of another map, with keys code, description, and value, all with string values.

    If the map returned by uml:parse-xml-robustly() doesn't contain the parsed XML, then I use another auxiliary function to return the error map from that map in the form of an element (because the function is declared to return an element).

    As a test I added a template to match an element called element and invoke the function.

    <xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:uml="https://example.com/uml"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      xmlns:err="http://www.w3.org/2005/xqt-errors"
      xmlns:map="http://www.w3.org/2005/xpath-functions/map"
      exclude-result-prefixes="#all">
      
      <xsl:output method="xml" indent="yes"/>
      <xsl:mode on-no-match="shallow-copy"/>
      
      <xsl:template match="element">
        <xsl:variable name="parser" select="uml:documentation-parser(())"/>
        <xsl:copy>
          <xsl:sequence select="$parser(.)"/>
        </xsl:copy>
      </xsl:template>
      
      <xsl:function name="uml:parse-xml-robustly" as="map(*)">
        <xsl:param name="text" as="xs:string"/>
        <xsl:try>
          <xsl:sequence select="
            map{
              'text': $text,
              'xml': parse-xml($text),
              'error': ()
            }
          "/>
          <xsl:catch select="
            map{
              'text': $text,
              'xml': (),
              'error': map{
                'code': 'Q{' || namespace-uri-from-QName($err:code) || '}' || local-name-from-QName($err:code),
                'description': $err:description,
                'value': $err:value
              }
            }
          "/>
        </xsl:try>
      </xsl:function>
      
      <xsl:function name="uml:error-as-element" as="element(error)">
        <xsl:param name="error" as="map(*)"/>
        <error>
          <xsl:for-each select="map:keys($error)">
            <xsl:attribute name="{.}" select="$error(.)"/>
          </xsl:for-each>
        </error>
      </xsl:function>
      
      <xsl:function name="uml:documentation-parser" as="function (element()) as element()*">
        <xsl:param name="entity-file" as="xs:string?"/>
        <xsl:sequence select="
          let $doctype := if ($entity-file) then
              '&lt;!DOCTYPE root [&lt;!ENTITY % entities SYSTEM ''' || $entity-file || '''&gt;%entities;]&gt;'
          else
              '',
              $prolog := '&lt;root xmlns=''http://docbook.org/ns/docbook''&gt;',
              $epilog := '&lt;/root&gt;'
          return
              function ($element as element(*)) as element()* {
                  let $text := $element/ownedComment/@body
                  return
                      if ($text) then
                        let $result := 
                          concat($doctype, $prolog, $text, $epilog) => uml:parse-xml-robustly()
                        return
                          ($result('xml')/*/*, $result('error')!uml:error-as-element(.))[1]
                      else
                          ()
              }"/>
        </xsl:function>
    </xsl:stylesheet>
    
    

    Test document:

    <root>
      <element>
        <ownedComment body="&lt;p>This is a &lt;i>comment&lt;/i>&lt;/p>"/>
      </element>
      <element>
        <ownedComment body="&lt;p>This is a &lt;i>comment&lt;/i>&lt;p>"/>
      </element>
    </root>
    

    Result:

    <root>
       <element>
          <p xmlns="http://docbook.org/ns/docbook">This is a <i>comment</i>
          </p>
       </element>
       <element>
          <error code="Q{http://www.w3.org/2005/xqt-errors}FODC0006"
                 value="org.xml.sax.SAXParseException; systemId: urn:from-string; lineNumber: 1; columnNumber: 77; The element type &#34;p&#34; must be terminated by the matching end-tag &#34;&lt;/p&gt;&#34;."
                 description="First argument to parse-xml() is not a well-formed and namespace-well-formed XML document. org.xml.sax.SAXParseException; systemId: urn:from-string; lineNumber: 1; columnNumber: 77; The element type &#34;p&#34; must be terminated by the matching end-tag &#34;&lt;/p&gt;&#34;.The element type &#34;p&#34; must be terminated by the matching end-tag &#34;&lt;/p&gt;&#34;."/>
       </element>
    </root>