arraysswiftstringsetnscharacterset

Strange Behavior In CharacterSet.contains() Method, With High UTF8 Characters Mixed With ASCII


Here's the deal: I am creating a StringProtocol extension to add the ability to do a split, based on a character set (any character in the set is used to split -greedy comparison).

The issue is that I am having difficulties comparing against a CharacterSet that has BOTH low-number ASCII characters AND high-number UTF8 characters.

If I present only UTF8 high or ASCII, the match works fine.

I created a playground that illustrates this.

The strange result is the second-to-last printout ("Test String 2 does not have a space or a joker."). That should say "does."

The issue is that the space in the CharacterSet matches, but the joker card does not.

Any ideas? Here's the playground:

import Foundation

public extension StringProtocol {
    func containsOneOfThese(_ inCharacterset: CharacterSet) -> Bool {
        self.contains { (char) in
            char.unicodeScalars.contains { (scalar) in inCharacterset.contains(scalar) }
        }
    }
}

let space = " "
let joker = "🃟"
let both = space + joker

let spadesNumberCards = "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪"
let spadesFaceCards = "🃛🂮🂫🂭"

let testString1 = spadesNumberCards + space + spadesFaceCards
let testString2 = spadesNumberCards + joker + spadesFaceCards
let testString3 = spadesNumberCards + both + spadesFaceCards

print("These Are The Strings We Are Testing:\n")
print("Test String 1: \"\(testString1)\"")
print("Test String 2: \"\(testString2)\"")
print("Test String 3: \"\(testString3)\"")
      
print("\nFirst, See If Any Of the Strings Contain Spaces:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")

print("\nNext, See If Any Of the Strings Contain Jokers:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")

print("\nOK, Now it gets weird:\n")

print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")

Which prints out:

These Are The Strings We Are Testing:

Test String 1: "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪 🃛🂮🂫🂭"
Test String 2: "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🃟🃛🂮🂫🂭"
Test String 3: "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪 🃟🃛🂮🂫🂭"

First, See If Any Of the Strings Contain Spaces:

Test String 1 does have a space.
Test String 2 does not have a space.
Test String 3 does have a space.

Next, See If Any Of the Strings Contain Jokers:

Test String 1 does not have a joker.
Test String 2 does have a joker.
Test String 3 does have a joker.

OK, Now it gets weird:

Test String 1 does have a space or a joker.
Test String 2 does not have a space or a joker.
Test String 3 does have a space or a joker.

Solution

  • It seems that CharacterSet.init(charactersIn string: String) does not work correctly if the string contains characters from both inside and outside the BMP (basic multilingual plane):

    let s = " 🃟"
    let cs = CharacterSet(charactersIn: s)
    s.unicodeScalars.forEach {
        print(cs.contains($0))
    }
    
    // Expected output: true, true
    // Actual output:   true, false
    

    A workaround is to use create the character set from the sequence of Unicode scalars instead:

    let cs = CharacterSet(s.unicodeScalars)
    

    This will produce the expected output.

    But note that this cannot handle the full range of Swift Characters (which include grapheme clusters consisting of multiple Unicode scalars). Therefore you might want to work with a Set<Character> instead.