regexsyntaxxsd

XML regular expressions syntax


I'm trying to write my own regexp parser, thus I read related W3C documents. The standard document XML Schema Part 2: Datatypes Second Edition gives the following definition for normal character (with a well-known bug of lacking curly braces):

A normal character is any XML character that is not a metacharacter. (...)

[10] Char ::= [^.\?*+()|#x5B#x5D]

Then the comment appears:

Note that a ·normal character· can be represented either as itself, or with a character reference. http://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-charref

I'm not very fluent in English and I am not sure how to understand that. If authors put a special emphasis on the possibility of representing ·normal characters· with character references then I expect that such representation for metacharacters is not allowed. Am I right at this point?

And if I am, what are the implications, if a character reference specifies a code point of a metacharacter, say asterisk, as in a*?

  1. Is this expression simply invalid?
  2. Or rather the reference becomes implicitly a normal character, and the expression is equivalent to a\* (with asterisk escaped)?
  3. Something else?

All examples I have found with Google use character references to put metacharacters in chargroups of character class expressions. However the Char symbol appears in the production 9 of regexp syntax, as one of three versions of an Atom, and neither Atom nor Char itself is used to define any kind of chargroup -- an XmlChar is used instead, which in turn has no comment attached about character references usage.

Please clarify the mess in my head:


Solution

  • does a metacharacter specified with a character reference become a normal character? How should a* work?

    No, it becomes a* and * is still a meta character which can be escaped as \*

    Coming to the next question:

    From http://msdn.microsoft.com/en-us/library/ms256185.aspx

    charRange ::= seRange | XmlCharRef | XmlCharIncDash
    

    where

    XmlCharRef ::= ( '&#' [0-9]+ ';' ) | ('&#x' [0-9a-fA-F]+ ';' )
    

    But from w3,

    charRange ::= seRange | XmlCharIncDash
    

    the XmlCharRef is not included. So,

    is a character reference valid between [ and ] (inside character class expressions(http://www.w3.org/TR/xmlschema-2/#dt-charexpr))?

    No