javascriptstringunicodeinternationalizationstring-length

Can `toLowerCase` change a JavaScript string's length?


Is it ever possible that string.length !== string.toLowerCase().length?

I know it is possible for toUpperCase, given the answers at Change to .length with toUpperCase?, but I don't know whether it's possible for toLowerCase.


Solution

  • Yes, it's possible, not all case mappings are one to one, they can also be one-to-many as you've seen when going from lowercase to uppercase, the same can apply when going from uppercase to lowercase. This can be seen with the LATIN CAPITAL LETTER I WITH DOT ABOVE, 0x0130 character - ie: İ:

    const char = "\u0130"; // İ
    const charLower = char.toLowerCase();
    console.log(char, char.length); // İ 1
    console.log(charLower, charLower.length); // i̇ 2
    console.log(char.length !== charLower.length); // true

    There is also a note in the ECMAScript specification that highlights this behavior at section 22.1.3.26 aswell:

    The case mapping of some code points may produce multiple code points. In this case the result String may not be the same length as the source String

    A list of special case mappings (ie: case mappings that aren't necessarily one-to-one) can be found in the Unicode Character Database (UCD). As listed, the length of some characters only grows when under certain conditions, some of these depend on specific contexts and locales:

    // From the UCD for SpecialCasings, another example can be found: 
    // 012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK
    // The above means '012E' maps to the lower case of '012F 0307' if:
    // - the locale is Lithuanian (lt)
    // - the suffix contains a character of combining class 230 (Above)
    // \u0300 is a character with such a combining class value (list found here: https://www.compart.com/en/unicode/combining/230)
    
    const grapheme = '\u012E\u0300'; // Į̀ (Į +  ̀ )
    console.log(grapheme, grapheme.length); // Į̀ 2
    
    const lowerStd = grapheme.toLowerCase();
    console.log(lowerStd, lowerStd.length); // į̀ 2 (still fine)
    
    const lowerLocale = grapheme.toLocaleLowerCase('lt');
    console.log(lowerLocale, lowerLocale.length); // į̇̀ 3 (now 3 when using lt as the locale)