swiftunicodenoncharacter

Which unicode code can be used safely as reserved value?


Background

I am writing a DFA based regex parser, for performance reasons, I need to use a dictionary [Unicode.Scalar : State] to map the next states. Now I need a bunch of special unicode values to represent special character expressions like ., \w, \d...

My Question

Which of the unicode values are safe to use for this purpose?

I was using U+0000 for ., but I need more now. I checked the unicode documentation, the Noncharacters seems promising, but in swift, those are considered invalid unicode. For example, the following code gives me a compiler error Invalid unicode scalar.

let c = "\u{FDD0}"

Solution

  • If you insist on using Unicode.Scalar, nothing. Unicode.Scalar is designed to represent all valid characters in Unicode, including not-assigned-yet code points. So, it cannot represent noncharacters nor dangling surrogates ("\u{DC00}" also causes error).

    And in Swift, String can contain all valid Unicode.Scalars including U+0000. For example "\u{0000}" (== "\0") is a valid String and its length (count) is 1. With using U+0000 as a meta-character, your code would not work with valid Swift Strings.


    Consider using [UInt32: State] instead of [Unicode.Scalar: State]. Unicode uses only 0x0000...0x10FFFF (including noncharacters), so using values greater than 0x10FFFF is safe (in your meaning).

    Also getting value property of Unicode.Scalar takes very small cost and its cost can be ignored in optimized code. I'm not sure using Dictionary is really a good way to handle your requirements, but [UInt32: State] would be as efficient as [Unicode.Scalar: State].