Background
I am writing a DFA based regex parser, for performance reasons, I need to use a dictionary [Unicode.Scalar : State]
to map the next states. Now I need a bunch of special unicode values to represent special character expressions like .
, \w
, \d
...
My Question
Which of the unicode values are safe to use for this purpose?
I was using U+0000
for .
, but I need more now. I checked the unicode documentation, the Noncharacters seems promising, but in swift, those are considered invalid unicode. For example, the following code gives me a compiler error Invalid unicode scalar
.
let c = "\u{FDD0}"
If you insist on using Unicode.Scalar
, nothing. Unicode.Scalar
is designed to represent all valid characters in Unicode, including not-assigned-yet code points. So, it cannot represent noncharacters nor dangling surrogates ("\u{DC00}"
also causes error).
And in Swift, String
can contain all valid Unicode.Scalar
s including U+0000
. For example "\u{0000}"
(== "\0"
) is a valid String and its length (count
) is 1. With using U+0000
as a meta-character, your code would not work with valid Swift Strings.
Consider using [UInt32: State]
instead of [Unicode.Scalar: State]
. Unicode uses only 0x0000...0x10FFFF
(including noncharacters), so using values greater than 0x10FFFF
is safe (in your meaning).
Also getting value
property of Unicode.Scalar
takes very small cost and its cost can be ignored in optimized code. I'm not sure using Dictionary
is really a good way to handle your requirements, but [UInt32: State]
would be as efficient as [Unicode.Scalar: State]
.