Here's the deal: I am creating a StringProtocol extension to add the ability to do a split, based on a character set (any character in the set is used to split -greedy comparison).
The issue is that I am having difficulties comparing against a CharacterSet that has BOTH low-number ASCII characters AND high-number UTF8 characters.
If I present only UTF8 high or ASCII, the match works fine.
I created a playground that illustrates this.
The strange result is the second-to-last printout ("Test String 2 does not have a space or a joker.
"). That should say "does."
The issue is that the space in the CharacterSet matches, but the joker card does not.
Any ideas? Here's the playground:
import Foundation
public extension StringProtocol {
func containsOneOfThese(_ inCharacterset: CharacterSet) -> Bool {
self.contains { (char) in
char.unicodeScalars.contains { (scalar) in inCharacterset.contains(scalar) }
}
}
}
let space = " "
let joker = "🃟"
let both = space + joker
let spadesNumberCards = "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪"
let spadesFaceCards = "🃛🂮🂫ðŸ‚"
let testString1 = spadesNumberCards + space + spadesFaceCards
let testString2 = spadesNumberCards + joker + spadesFaceCards
let testString3 = spadesNumberCards + both + spadesFaceCards
print("These Are The Strings We Are Testing:\n")
print("Test String 1: \"\(testString1)\"")
print("Test String 2: \"\(testString2)\"")
print("Test String 3: \"\(testString3)\"")
print("\nFirst, See If Any Of the Strings Contain Spaces:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("\nNext, See If Any Of the Strings Contain Jokers:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("\nOK, Now it gets weird:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
Which prints out:
These Are The Strings We Are Testing:
Test String 1: "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪 🃛🂮🂫ðŸ‚"
Test String 2: "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🃟🃛🂮🂫ðŸ‚"
Test String 3: "🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪 🃟🃛🂮🂫ðŸ‚"
First, See If Any Of the Strings Contain Spaces:
Test String 1 does have a space.
Test String 2 does not have a space.
Test String 3 does have a space.
Next, See If Any Of the Strings Contain Jokers:
Test String 1 does not have a joker.
Test String 2 does have a joker.
Test String 3 does have a joker.
OK, Now it gets weird:
Test String 1 does have a space or a joker.
Test String 2 does not have a space or a joker.
Test String 3 does have a space or a joker.
It seems that CharacterSet.init(charactersIn string: String)
does not work correctly if the string contains characters from both inside and outside the BMP (basic multilingual plane):
let s = " 🃟"
let cs = CharacterSet(charactersIn: s)
s.unicodeScalars.forEach {
print(cs.contains($0))
}
// Expected output: true, true
// Actual output: true, false
A workaround is to use create the character set from the sequence of Unicode scalars instead:
let cs = CharacterSet(s.unicodeScalars)
This will produce the expected output.
But note that this cannot handle the full range of Swift Character
s (which include grapheme clusters consisting of multiple Unicode scalars). Therefore you might want to work with a Set<Character>
instead.