c++xmlxsdxerces-c

Xerces-C++ XMLString::patternMatch() not functioning correctly


I'm trying to find a way to match strings in C++ against XML Schema regular expressions. The XML Schema regex grammar is not supported by std::regex so I installed the Xerces-C++ XML library to use its pattern matching functionality. Unfortunately even with a basic example, it doesn't seem to work right.

#include <iostream>
#include <xercesc/util/XMLString.hpp>

using namespace XERCES_CPP_NAMESPACE;

int main()
{
    try
    {
        XMLPlatformUtils::Initialize();
    }
    catch (const XMLException& ex)
    {
        char* message = XMLString::transcode(ex.getMessage());
        std::cerr << "Error during Xerces-c Initialization.\n"
            << "  Exception message:"
            << message;
        XMLString::release(&message);
        return 1;
    }

    const XMLCh* str = XMLString::transcode("bcdfg");

    // Implement a simple regex that uses "character class subtraction"
    // Should match any string that does not contain vowels
    const XMLCh* pattern = XMLString::transcode("[a-z-[aeiuo]]+");

    if (XMLString::patternMatch(str, pattern) != -1)
    {
        std::cout << "Match!" << std::endl;
    }
    else
    {
        std::cout << "No match." << std::endl;
    }

    XMLPlatformUtils::Terminate();
    return 0;
}

Output: No Match.

If I write a very simple regex that doesn't use character class subtraction it does seem to work. But the issue is I need character class subtraction to work because I need to support any possible regex that conforms to the XML Schema regex grammar.

The documentation for Xerces is very unclear and doesn't specify which regex grammar is used by this function, but I was assuming since it is an XML parsing library it would implement XML regular expressions. Perhaps that assumption was wrong?

EDIT:

Adding an example of an actual regex from an XSD file that I will need to support. This example comes from the schema that defines the basic datatypes supported by XML Schemas. The specification can be found here: https://www.w3.org/TR/xmlschema-2/#conformance

An example of a regular expression I will need to parse that uses character class subtraction (as well as the special \c and \i character groups is shown in the xs:pattern restriction for the "NCName" datatype below:

  <xs:simpleType name="NCName" id="NCName">
    <xs:annotation>
      <xs:documentation source="http://www.w3.org/TR/xmlschema-2/#NCName"/>
    </xs:annotation>
    <xs:restriction base="xs:Name">
      <xs:pattern value="[\i-[:]][\c-[:]]*" id="NCName.pattern">
        <xs:annotation>
          <xs:documentation
               source="http://www.w3.org/TR/REC-xml-names/#NT-NCName">
            pattern matches production 4 from the Namespaces in XML spec
          </xs:documentation>
        </xs:annotation>
      </xs:pattern>
    </xs:restriction>
  </xs:simpleType>

Solution

  • Okay so I wasn't able to get the Xerces regular expressions to work, and the documentation was nothing short of abysmal, so I decided to try out another library. libxml2 has XML regular expressions and although the documentation for the regex feature was similarly abysmal, I was able to get a working program.

    #include <iostream>
    #include <libxml/xmlregexp.h>
    
    int main()
    {
        LIBXML_TEST_VERSION;
    
        xmlChar* str = xmlCharStrdup("bcdfg");
        xmlChar* pattern = xmlCharStrdup("[a-z-[aeiou]]+");
        xmlRegexp* regex = xmlRegexpCompile(pattern);
    
        if (xmlRegexpExec(regex, str) == 1)
        {
            std::cout << "Match!" << std::endl;
        }
    
        free(regex);
        free(pattern);
        free(str);
    }
    

    Output:

    Match!

    I figured even though it does not answer how to get regular expressions to work properly with Xerces, this answer may help others who are looking to solve the same problem of getting XML Schema regular expressions to work in C++.