utf-8unicode-normalizationcanonicalizationcanonical-form

What is the longest UTF8 representation of an NFC-form string of a given length?


Context.

I'm writing C to the iCal (RFC 5545) spec. It specifies the maximum length of a delimited line to be 75 octets excluding the delimiter. Both the robustness principle and the W3C character model incline me to canonicalize input strings encoded in UTF8 to NFC form (see Unicode Normalization Forms).

When reading input lines, I'd like to read into a statically allocated buffer. But the UTF8 representation of a line might be more than 75 octets even when its NFC form is less than 75. So this buffer will need to be larger than 75 octets. My question is how many.

Question.

What is the maximum length in octets of a UTF8 string whose NFC form is at most 75 octets? (Bonus points: whose NFC form is at most N octets.)

Also, is this guaranteed and permanent or is it an unspecified consequence of the current Unicode and subject to change?


Solution

  • Here's some Javascript code that tries to find the Unicode codepoint whose UTF-8 representation shrinks the most when converted to NFD and back to NFC. It seems that no codepoint shrinks by more than a factor of three. As far as I understand the Unicode normalization algorithm, only single codepoints have to be checked this way.

    I think that, at least theoretically, this could change in future versions of Unicode. But there's a stability policy regarding expansion of strings when normalizing to NFC (also see Can Unicode NFC normalization increase the length of a string?), so I think it's highly unlikely that this will ever change:

    Canonical mappings (Decomposition_Mapping property values) are always limited so that no string when normalized to NFC expands to more than 3× in length (measured in code units).

    So allocating an initial buffer three times larger than your maximum line length seems like a reasonable choice.

    var maxRatio = 2;
    var codePoints = [];
    
    for (var i=0; i<0x110000; i++) {
      // Exclude surrogates
      if (i >= 0xD800 && i <= 0xDFFF) continue;
      var nfd = String.fromCodePoint(i).normalize('NFD');
      var nfc = nfd.normalize('NFC');
      var nfdu8 = unescape(encodeURIComponent(nfd));
      var nfcu8 = unescape(encodeURIComponent(nfc));
      var ratio = nfdu8.length / nfcu8.length;
      if (ratio > maxRatio) {
        maxRatio = ratio;
        codePoints = [ i ];
      }
      else if (ratio == maxRatio) {
        codePoints.push(i);
      }
    }
    
    console.log(`Max ratio: ${maxRatio}`);
    
    for (codePoint of codePoints) {
      // Exclude Hangul syllables
      if (codePoint >= 0xAC00 && codePoint <= 0xD7AF) continue;
      var nfd = String.fromCodePoint(codePoint).normalize('NFD');
      var nfc = nfd.normalize('NFC');
      console.log(
        codePoint.toString(16).toUpperCase(),
        encodeURIComponent(nfd),
        encodeURIComponent(nfc)
      );
    }