recursionrelaxngrelaxng-compact

Express a recursive reference in a set of recurring ordered elements


I'm trying to write a RelaxNG schema that has the following rules:

  1. A line element can contain zero or more a and b elements.
  2. Every a element must have a corresponding b element and vice-versa.
  3. a elements must always precede their matching b elements.

So, the following should all be considered valid:

<line><a/><b/></line>

<line><a/><a/><b/><b/></line>

<line><a/><a/><b/><a/><b/><b/></line>

Meanwhile, the following are all invalid:

<line><b/><a/></line>

<line><a/><a/><b/></line>

<line><a/></line>

<line><b/></line>

How can this be expressed in RelaxNG? My first thought was to create a recursive reference, like so:

element line { pair* }+

pair = a, pair?, b

a = element a { empty }
b = element b { empty }

However, Jing considers this a "bad recursive reference to 'pair'". I can't for the life of me figure out how to solve this! Any ideas?


Solution

  • The patterns of Relax NG (like the content models of DTDs and XSD) are essentially regular expressions; they define regular languages over the set of element names and text. The a/b matching you appear to be looking for requires a context-free grammar; it cannot, therefore, be defined in Relax NG.

    It can, of course, be approximated. If you expect that in practice you will never have more than five a/b pairs (or, more precisely: never more than five a elements for which you have not yet seen the matching b), you can define either of the following languages.

    First:

    pair0 = (a, b)*
    pair1 = (a, pair0, b)*
    pair2 = (a, pair1, b)*
    pair3 = (a, pair2, b)*
    pair4 = (a, pair3, b)*
    pair5 = (a, pair4, b)*
    pair6 = (a, pair5, b)*
    pair7 = (a, pair6, b)*
    pair8 = (a, pair7, b)*
    pair9 = (a, pair8, b)*
    pair  = (a, pair9, b)*
    

    This defines the subset of your language which strictly enforces your two rules but cannot handle a/b pairs nesting more than 10 deep. (Or 11, depending on what you're counting.) Every document accepted by this definition will be a member of the language you actually want, but not every member of the language will be accepted.

    If rejecting valid instances of the language is not acceptable, you can define patterns as above, but redefine pair0 as:

    pair0 = (a, (a|b)*, b)*
    

    This defines a superset of the language, in which the rules are enforced for a/b pairs up to the maximum level of nesting, but in which the rules are abandoned once the nesting level exceeds that maximum. In this case, every member of the desired language will be accepted, but so will some garbage that shouldn't be.

    Whether these approximations are any use in your application I leave to you to decide.

    If approximations are not acceptable, you may find it easier to get what you need by defining the XML differently.

    One simple change would be to wrap each matching a and b in an element (which I'll call e):

    element line { pair* }
    a = element a { empty }
    b = element b { empty }
    pair = element e { a, pair*, b }
    

    Now the validity of the document provides a very simple guarantee that the a elements and b elements pair up as required and that each a precedes its matching b.

    Given that the a and b are empty, that each a immediately follows a start-tag for e, and that each b immediately precedes the end-tag for the same e element, you could replace the a and b elements entirely with the start-tags for e and end-tags for e, respectively, and declare line as

    element line { e* }
    e = element e { e* }
    

    SAX interfaces, DOM interfaces, XSLT, XQuery, and every other method I know of handling XML makes it roughly as simple to associate actions with the start and end of an element as it is to associate actions with an empty element.

    Your valid examples then become:

    <line><e></e></line>
    <line><e><e></e></e></line>
    <line><e><e></e><e></e></e></line>
    

    and your invalid examples become non-wellformed data.