Characters with accents in Unicode strings can be represented in a "short" (composed) and "long" (decomposed) format. This means in Xcode string a
has a length of 8 and string b
has a length of 10 even though they appear the same:
let a:String = "δέκα" // 8 bytes
print(a.data(using:String.Encoding.utf8)!.count)
let b:String = "δέκα" // 10 bytes
print(b.data(using:String.Encoding.utf8)!.count)
I need to "shrink" strings to ensure they are always in the shorter "composed" format. How is this done in Swift?
Footnote: I know that it is possible to completely strip accents like this (below). I don't want to do that, I just want to "compose" the characters.
let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = "δέκα".folding(options: [.caseInsensitive, .diacriticInsensitive], locale: usPosixLocale)
I am aware of the .widthInsensitive
option, but the documentation appears to indicate that it's only for asian characters. So specifically, this does not work to compose or decompose characters:
let out = a.folding(options: [.widthInsensitive], locale: usPosixLocale)
UPDATE
Here is a second longer version of the code that shows the byte difference for clarity.
let a:String = String(bytes:[206, 180, 206, 173, 206, 186, 206, 177], encoding:.utf8)!
print(a, a.data(using:String.Encoding.utf8)!.count)
let b:String = String(bytes:[206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding:.utf8)!
print(b, b.data(using:String.Encoding.utf8)!.count)
let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = b.folding(options: [.widthInsensitive], locale: usPosixLocale)
print(out.data(using:String.Encoding.utf8)!.count)
precomposedStringWithCanonicalMapping
does the normalization:
let a = "δέκα"
print(a, Data(a.utf8).count) // δέκα 8
let b = "δε\u{0301}κα"
print(b, Data(b.utf8).count) // δέκα 10
let bn = b.precomposedStringWithCanonicalMapping
print(bn, Data(bn.utf8).count) // δέκα 8
A “literal“ comparison demonstrates that a
is identical to bn
, but not to b
:
print(b.compare(a, options: .literal) == .orderedSame) // false
print(bn.compare(a, options: .literal) == .orderedSame) // true
Remarks: precomposedStringWithCanonicalMapping
produces the “Unicode Normalization Form C.” There is also precomposedStringWithCompatibilityMapping
which produces the “Unicode Normalization Form KC.” See
in the Unicode Standard for the precise differences. Roughly, the latter folds more differences which are “inappropriately distinguished in many circumstances.” Examples:
let c = "\u{fb01}" // LATIN SMALL LIGATURE FI
print(c, c.precomposedStringWithCanonicalMapping, c.precomposedStringWithCompatibilityMapping)
// fi fi fi
let d = "2\u{2075}"
print(d, d.precomposedStringWithCanonicalMapping, d.precomposedStringWithCompatibilityMapping)
// 2⁵ 2⁵ 25
let e = "\u{2165}" // ROMAN NUMERAL SIX
print(e, e.precomposedStringWithCanonicalMapping, e.precomposedStringWithCompatibilityMapping)
// Ⅵ Ⅵ VI