unicodeunicode-normalizationgrapheme-cluster

Will normalizing a string give the same result as normalizing the individual grapheme clusters?


Would the result of performing Unicode normalization on a string (assuming no isolated combining characters) be the same as the result of splitting the string into grapheme clusters, normalizing each cluster individually then concatenating the normalized grapheme clusters? (If so, does this only apply to a subset of the normalization forms?)

Asking this mainly out of interest in how Unicode works and figuring out what potential edge cases there might be rather than as part of a concrete application.


Solution

  • No, that generally is not true. The Unicode Standard warns against the assumption that concatenating normalised strings produces another normalised string. From UAX #15:

    In using normalization functions, it is important to realize that none of the Normalization Forms are closed under string concatenation. That is, even if two strings X and Y are normalized, their string concatenation X+Y is not guaranteed to be normalized.

    Many aspects of the Unicode text segmentation algorithm are tailorable; the standard mostly just provides default values that are useful in most contexts, but can be overridden when necessary for a certain purpose. Therefore, there is no guarantee that two Unicode-compliant applications are even going to agree on where grapheme boundaries are situated. A concrete example is the difference between legacy grapheme clusters and extended grapheme clusters.

    In the former, characters with the Grapheme_Cluster_Break property values Spacing_Mark or Prepend do not act as grapheme extenders, while in the latter they do. As of Unicode 12.1, there are twelve such characters with a non-zero canonical combining class. These characters would break your method if you used the legacy grapheme cluster definition, such as in the following sequence:

    <U+1D158, U+1D16D, U+1D166>

    which is

    Because both the combining augmentation dot and the combining sprechgesang stem are Spacing_Mark, this sequence is actually divided into three legacy grapheme clusters, each only one character in length and thus automatically normalised. The real normalisation of the entire string would switch the positions of the dot and stem, however, because of their CCC values.

    If we ignore the possibility of tailoring the algorithm and focus only on extended grapheme clusters strictly as defined in the standard, then normalising each grapheme cluster individually should produce the same result as normalising the whole string at once to the best of my knowledge, but there is no formal guarantee that future revisions of the standard won’t change that.