swiftstringunicodearabic

How to decompose Arabic letter?


I need to decompose an Arabic word into its consonants and vowels. For instance, "ضَرَبَ" has three consonants and three vowels and therefore I would like its length to be 6 instead of 3. However:

let t = "ضَرَبَ"
let ud = t.decomposedStringWithCanonicalMapping
print("ud Length = \(ud.count)")

I get 3 instead of 6... How to decompose this string into the following array:

"\u{0636}\u{064e}\u{0631}\u{064e}\u{0628}\u{064e}"

Solution

  • Your goal here is to consider Unicode code points rather than a collection of Swift Character (i.e. extended grapheme clusters), after applying normalization. You can do that with .unicodeScalars:

    print("ud Length = \(ud.unicodeScalars.count)")  // ud Length = 6
                            ^^^^^^^^^^^^^^
    

    Keep in mind that this is not just "consonants and vowels." Things like shaddah and nunation will also be code points after normalization (I assume that's a benefit for your use case; just something to keep in mind).

    Your question about "decompose this string into the following array" is somewhat misguided. The example you give is a String, not an Array. But importantly, it is the same String as t. (Check it with ==.) If you want an Array of UnicodeScalars, however, that would be Array(ud.unicodeScalars).