I'm trying to write my own regexp parser, thus I read related W3C documents. The standard document XML Schema Part 2: Datatypes Second Edition gives the following definition for normal character
(with a well-known bug of lacking curly braces):
A normal character is any XML character that is not a metacharacter. (...)
[10] Char ::= [^.\?*+()|#x5B#x5D]
Then the comment appears:
Note that a ·normal character· can be represented either as itself, or with a character reference. http://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-charref
I'm not very fluent in English and I am not sure how to understand that. If authors put a special emphasis on the possibility of representing ·normal characters· with character references then I expect that such representation for metacharacters is not allowed. Am I right at this point?
And if I am, what are the implications, if a character reference specifies a code point of a metacharacter, say asterisk, as in a*
?
a\*
(with asterisk escaped)?All examples I have found with Google use character references to put metacharacters in chargroups of character class expressions. However the Char
symbol appears in the production 9 of regexp syntax, as one of three versions of an Atom
, and neither Atom
nor Char
itself is used to define any kind of chargroup
-- an XmlChar
is used instead, which in turn has no comment attached about character references usage.
Please clarify the mess in my head:
a*
work?[
and ]
(inside character class expressions)?does a metacharacter specified with a character reference become a normal character? How should a* work?
No, it becomes a*
and * is still a meta character which can be escaped as \*
Coming to the next question:
From http://msdn.microsoft.com/en-us/library/ms256185.aspx
charRange ::= seRange | XmlCharRef | XmlCharIncDash
where
XmlCharRef ::= ( '&#' [0-9]+ ';' ) | ('&#x' [0-9a-fA-F]+ ';' )
But from w3,
charRange ::= seRange | XmlCharIncDash
the XmlCharRef is not included. So,
is a character reference valid between [ and ] (inside character class expressions(http://www.w3.org/TR/xmlschema-2/#dt-charexpr))?
No