unicodetext-segmentation

Non reducable grapheme clusters in unicode


I'm of the opinion that "user perceived character" (henceforth UPC) iterator would be very useful in a unicode library. By UPC I mean the sense discussed in unicode standard annex 29, which is what a user perceives as a character, but might be represented in unicode as a codepoint or a grapheme-cluster. Since I typically work with latin languages, I always come up with examples like "I want to handle ü as one UPC, regardless of whether the UPC is a grapheme cluster, or a single codepoint".

Colleagues who are against a UPC iterator (or grapheme cluster iterator, take your pick) counter "You can normalize to NFC, and then use codepoint iteration", and "there is no use case for grapheme cluster iteration".

I keep thinking of latin-centric use cases, which maybe don't translate well to the unicode universe -- like I'm doing terminal output, I want to pad a column to N column widths, so I want to know how many UPCs are in a string...

I think what I want to know is:

  1. Are there meaningful grapheme clusters which can't be normalized to a single codepoint? Are there any that are likely to occur among western users? I'm assuming Korean or Arabic are cases of this, but I have to admit to total ignorance there.
  2. Do any other languages provide UPC/grapheme cluster iteration/operations? Is there any kind of advice from the Unicode specification?

Solution

  • It's unclear how your questions are not answered by UAX #29:

    1. There are many such grapheme clusters, even for languages that only use the Latin alphabet as not all combining marks have compositions with all other letters/forms—for example, the gaps in this table on Wikipedia. Table 1a in UAX #29 has several non-Latin examples.

    2. This is the purpose of UAX #29: to generalise grapheme cluster operations to all languages that are supported in Unicode.