regex.net-core

GeneratedRegex("[a-z]") appends '\u212A' to the "a-z"


Here is my code:

    public const RegexOptions MyRegexOptions = RegexOptions.IgnoreCase |
                                                RegexOptions.IgnorePatternWhitespace |
                                                RegexOptions.ExplicitCapture |
                                                RegexOptions.CultureInvariant
                                                ;
    [GeneratedRegex("[a-z]", MyRegexOptions)]
    public static partial Regex AlphaKelvin();

The generated code is:

        /// <remarks>
        /// Pattern:<br/>
        /// <code>[a-z]</code><br/>
        /// Options:<br/>
        /// <code>RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant</code><br/>
        /// Explanation:<br/>
        /// <code>
        /// ○ Match a character in the set [A-Za-z\u212A].<br/>
        /// </code>
        /// </remarks>
        [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "8.0.12.16413")]
        public static partial global::System.Text.RegularExpressions.Regex WhereIsKelvin() => global::System.Text.RegularExpressions.Generated.WhereIsKelvin_0.Instance;

Now this is bizarre: why is there \u212A character (Kelvin sign) in my alpha character set? And it gets weirder: for all sets [a-k] - there is nothing extra, but from [a-l] down to [a-z] there is this extra character \u212A. Apparently, it has something to do with the IgnoreCase flag because lowercase "Kelvin sign character" is actually the Latin k.

This is wrong: my strings cannot contain the Unicode Kelvin sign.


Solution

  • It’s because of RegexOptions.IgnoreCase plus RegexOptions.CultureInvariant.

    The RegexOptions.IgnoreCase automatically expands lowercase to include uppercase, so it is clear that [a-z] would turn into [a-zA-Z]. However, you are also using RegexOptions.CultureInvariant, which forces culture-independent Unicode behavior. In Unicode, \u212A is - as you noticed - "Kelvin sign" (K). In culture-invariant case-insensitive matching, K (U+212A) is considered the uppercase equivalent of k (U+006B).

    You can read more about comparison using the invariant culture here and you may find which characters fold into which characters in the CaseFolding-16.0.0.txt document. You can see that

    212A; C; 006B; # KELVIN SIGN
    

    which means, the Kelvin sign folds into \u006B/\x6B (k).