swiftstringunicodeunicode-normalization

Normalizing (composing and decomposing) utf8 strings in Swift


Characters with accents in Unicode strings can be represented in a "short" (composed) and "long" (decomposed) format. This means in Xcode string a has a length of 8 and string b has a length of 10 even though they appear the same:

let a:String = "δέκα" // 8 bytes
print(a.data(using:String.Encoding.utf8)!.count)

let b:String = "δέκα" // 10 bytes
print(b.data(using:String.Encoding.utf8)!.count)

enter image description here

I need to "shrink" strings to ensure they are always in the shorter "composed" format. How is this done in Swift?


Footnote: I know that it is possible to completely strip accents like this (below). I don't want to do that, I just want to "compose" the characters.

let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = "δέκα".folding(options: [.caseInsensitive, .diacriticInsensitive], locale: usPosixLocale)

I am aware of the .widthInsensitive option, but the documentation appears to indicate that it's only for asian characters. So specifically, this does not work to compose or decompose characters:

let out = a.folding(options: [.widthInsensitive], locale: usPosixLocale)

UPDATE

Here is a second longer version of the code that shows the byte difference for clarity.

let a:String = String(bytes:[206, 180, 206, 173, 206, 186, 206, 177], encoding:.utf8)!
print(a, a.data(using:String.Encoding.utf8)!.count)

let b:String = String(bytes:[206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding:.utf8)!
print(b, b.data(using:String.Encoding.utf8)!.count)

let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = b.folding(options: [.widthInsensitive], locale: usPosixLocale)
    print(out.data(using:String.Encoding.utf8)!.count)

enter image description here


Solution

  • precomposedStringWithCanonicalMapping does the normalization:

    let a = "δέκα"
    print(a, Data(a.utf8).count) // δέκα 8
    
    let b = "δε\u{0301}κα"
    print(b, Data(b.utf8).count) // δέκα 10
    
    let bn = b.precomposedStringWithCanonicalMapping
    print(bn, Data(bn.utf8).count) // δέκα 8
    

    A “literal“ comparison demonstrates that a is identical to bn, but not to b:

    print(b.compare(a, options: .literal) == .orderedSame)  // false
    print(bn.compare(a, options: .literal) == .orderedSame) // true
    

    Remarks: precomposedStringWithCanonicalMapping produces the “Unicode Normalization Form C.” There is also precomposedStringWithCompatibilityMapping which produces the “Unicode Normalization Form KC.” See

    in the Unicode Standard for the precise differences. Roughly, the latter folds more differences which are “inappropriately distinguished in many circumstances.” Examples:

    let c = "\u{fb01}" // LATIN SMALL LIGATURE FI
    print(c, c.precomposedStringWithCanonicalMapping, c.precomposedStringWithCompatibilityMapping)
    // fi fi fi
    
    let d = "2\u{2075}"
    print(d, d.precomposedStringWithCanonicalMapping, d.precomposedStringWithCompatibilityMapping)
    // 2⁵ 2⁵ 25
    
    let e = "\u{2165}" // ROMAN NUMERAL SIX
    print(e, e.precomposedStringWithCanonicalMapping, e.precomposedStringWithCompatibilityMapping)
    // Ⅵ Ⅵ VI