phpxmlxsdcustom-data-attributerelaxng

How to validate documents with HTML data-* attributes?


I need to validate XML with dynamic attribute names, like data-*. Now I'm using RelaxNG schema, but it does not supports dynamic attribute names. What are the options? I cannot find anything relevant..

Example of XML:

<?xml version="1.0" encoding="utf-8"?>
<body xml:lang="cs" ns="www.x.y">
  <h id="x" ctime="2017-09">Heading..</h>
  <desc kw="kw">Desc..</desc>
  <section>
    <h data-foo="bar" id="one" short="One">First heading</h>
    <desc>Desc...</desc>
    <p>Content..</p>
    <ul data-buz="fuz">
      <li data-switch="click">list item</li>
      <li>list item 2</li>
    </ul>
  </section>
</body>

Solution

  • Preprocess the XML to drop the data-* attributes before giving it to the validation function. There is otherwise no way I know to validate it with RelaxNG or other grammar-based schema languages.

    As far as preprocessing the XML, one way to do that with an existing XML toolchain would be: run it through an XSLT transformation that drops the data-* attributes but passes on all else as-is:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version='1.0'>
      <xsl:output method="xml" indent="no"/>
      <xsl:template match="node() | @*">
        <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
      </xsl:template>
      <xsl:template match="@*[starts-with(name(), 'data-')]"/>
    </xsl:stylesheet>
    

    The <xsl:template match="@*[starts-with(name(), 'data-')]"/> is the important part there. That causes any data-* attribute to just be dropped on the floor. The rest of that XSL stylesheet is just a basic “identify transform” that passes on everything else from the source XML as-is.

    The W3C Nu Html Checker (HTML5 validator) backend does something for data-* attributes that’s functionally the same as that XSLT transformation, but written in Java. If you’re curious, the code for it is within the GitHub repo for the W3C Nu Html Checker sources, here:

    https://github.com/validator/validator/tree/master/src/nu/validator/xml/dataattributes

    See the filterAttributes code in DataAttributeDroppingContentHandlerWrapper.java

    It’s essentially a SAX filter that works at parse time off parse events prior to the validation function.

    And if you’re even more curious, there is code for other preprocessing filters doing similar things:

    Anyway, you get the general idea: If there are any cases of markup constructs in your source that you can’t express validation logic for in RelaxNG or XSD, then you essentially filter (preprocess) the source to hide that markup from the validation function.