unicodeterminologygraphemecombining-marks

What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?


What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?

They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction?


The Unicode Standard, Chapter 3, D52

The Unicode Standard, Chapter 3, D59


Solution

  • The difference in actual usage is that combining characters are defined as a General Category for rough classification of characters and grapheme extenders are mainly used for UAX #29 text segmentation.

    EDIT: Since you offered a bounty, I can elaborate a bit.

    Combining characters are characters that can't be use as stand-alone characters but must be combined with another character. They're used to define combining character sequences.

    Grapheme extenders were introduced in Unicode 3.2 to be used in Unicode Technical Report #29: Text Boundaries (then in a proposed status, now known as Unicode Standard Annex #29: Unicode Text Segmentation). The main use is to define grapheme clusters. Grapheme clusters are basically user-perceived characters. According to UAX #29:

    Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.

    The main difference is that grapheme extenders don't include most of the spacing marks (the set is actually smaller than the set of combining characters). Most of the spacing marks are vowel signs for Asian scripts. In these scripts, vowels are sometimes written by modifying a consonant character. If this modification takes up horizontal space (spacing mark), it used to be seen as a separate user-perceived character and forms a new (legacy) grapheme cluster. In later versions of UAX #29, this was changed and extended grapheme clusters were introduced where most but not all spacing marks don't break a cluster.

    I think they key sentence from the standard is: "A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character." Combining characters, on the other hand, also include spacing marks that are applied to the left or right. There are a few exceptions, though (see property Other_Grapheme_Extend).

    Example

    U+0995 BENGALI LETTER KA:

    U+09C0 BENGALI VOWEL SIGN II (combining character, but no grapheme extender):

    Combination of the two:

    কী

    This is a single combining character sequence consisting of two legacy grapheme clusters. The vowel sign can't be used by itself but it still counts as a legacy grapheme cluster. A text editor, for example, could allow to place the cursor between the two characters.

    There are over 300 combining characters like this which do not extend graphemes, and four characters which are not combining but do extend graphemes.