javascriptinternationalizationcollation

Is the lack of orthographic variant support for Scandinavian languages in the JavaScript Intl API a known limitation?


Official spelling reforms in Scandinavian languages in the 19th and 20th centuries replaced digraphs (two-letter combinations) with single, distinct letters:

In a search context, users absolutely expect these forms to be treated as equivalent. However, in the JavaScript Intl API, only å = aa is treated as equal. The others (æ = ae, ø = oe, ä = ae, ö = oe) are not.

Is this a known limitation of the JavaScript Intl API or the underlying ICU implementation, or am I missing a configuration option?

The code snippet below demonstrates the problem. A result of 0 indicates the collator treats the two strings as equivalent, which is what we expect for all these cases.

const options = { usage: 'search', sensitivity: 'base' };
const daCollator = new Intl.Collator('da', options);
const svCollator = new Intl.Collator('sv', options);

const results = [
  daCollator.compare('å', 'aa'), // 0 ✅ expected
  daCollator.compare('æ', 'ae'), // 1 ❌ unexpected
  daCollator.compare('ø', 'oe'), // 1 ❌ unexpected
  svCollator.compare('ä', 'ae'), // 1 ❌ unexpected
  svCollator.compare('ö', 'oe'), // 1 ❌ unexpected
];
const expected = [0, 0, 0, 0, 0];

console.log('Results:', results);
console.log('Expected:', expected);

References


Solution

  • The support for å = aa in Danish is clearly documented in the UCA Specification:

    For example, at a primary strength, "ß" would match against "ss" according to the UCA, and "aa" would match "å" in a Danish tailoring of the UCA.

    It is also documented in the ICU User Guide:

    For example, in Danish, ‘å’ (\u00e5) and ‘aa’ are considered equivalent.

    å = aa is historically the most entrenched and universally accepted equivalence. The CLDR maintainers are likely conservative and that may explain why the other equivalences are not implemented.