Here is my code:
public const RegexOptions MyRegexOptions = RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace |
RegexOptions.ExplicitCapture |
RegexOptions.CultureInvariant
;
[GeneratedRegex("[a-z]", MyRegexOptions)]
public static partial Regex AlphaKelvin();
The generated code is:
/// <remarks>
/// Pattern:<br/>
/// <code>[a-z]</code><br/>
/// Options:<br/>
/// <code>RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant</code><br/>
/// Explanation:<br/>
/// <code>
/// ○ Match a character in the set [A-Za-z\u212A].<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "8.0.12.16413")]
public static partial global::System.Text.RegularExpressions.Regex WhereIsKelvin() => global::System.Text.RegularExpressions.Generated.WhereIsKelvin_0.Instance;
Now this is bizarre: why is there \u212A
character (Kelvin sign) in my alpha character set? And it gets weirder: for all sets [a-k] - there is nothing extra, but from [a-l] down to [a-z] there is this extra character \u212A
. Apparently, it has something to do with the IgnoreCase
flag because lowercase "Kelvin sign character" is actually the Latin k
.
This is wrong: my strings cannot contain the Unicode Kelvin sign.
It’s because of RegexOptions.IgnoreCase
plus RegexOptions.CultureInvariant
.
The RegexOptions.IgnoreCase
automatically expands lowercase to include uppercase, so it is clear that [a-z]
would turn into [a-zA-Z]
. However, you are also using RegexOptions.CultureInvariant
, which forces culture-independent Unicode behavior. In Unicode, \u212A
is - as you noticed - "Kelvin sign" (K)
. In culture-invariant case-insensitive matching, K (U+212A)
is considered the uppercase equivalent of k (U+006B)
.
You can read more about comparison using the invariant culture here and you may find which characters fold into which characters in the CaseFolding-16.0.0.txt
document. You can see that
212A; C; 006B; # KELVIN SIGN
which means, the Kelvin sign folds into \u006B
/\x6B
(k
).