pythonsortinglocaleicu

icu: Sort strings based on 2 different locales


As you probably know, the order of alphabet in some (maybe most) languages is different than their order in Unicode. That's why we may want to use icu.Collator to sort, like this Python example:

from icu import Collator, Locale
collator = Collator.createInstance(Locale("fa_IR.UTF-8"))
mylist.sort(key=collator.getSortKey)

This works perfectly for Persian strings. But it also sorts all Persian strings before all ASCII / English strings (which is the opposite of Unicode sort).

What if we want to sort ASCII before this given locale?

Or ideally, I want to sort by 2 or multiple locales. (For example give multiple Locale arguments to Collator.createInstance)

If we could tell collator.getSortKey to return empty bytes for other locales, then I could create a tuple of 2 collator.getSortKey() results, for example:

from icu import Collator, Locale

collator1 = Collator.createInstance(Locale("en_US.UTF-8"))
collator2 = Collator.createInstance(Locale("fa_IR.UTF-8"))

def sortKey(s):
    return collator1.getSortKey(s), collator2.getSortKey(s)

mylist.sort(key=sortKey)

But looks like getSortKey always returns non-empty bytes.


Solution

  • A bit late to answer the question, but here it is for future reference.

    ICU collation uses the CLDR Collation Algorithm, which is a tailoring of the Unicode Collation Algorithm. The default collation is referred to as the root collation. Don't think in terms of Locales having a set of collation rules, think more in terms of locales specify any differences between the collation rules that the locale needs and the root collation. CLDR takes a minimalist approach, you only need to include the minimal set of differences needed based on the root collation.

    English uses the root locale. No tailorings. Persian on the other hand has a few rules needed to override certain aspects of the root collation.

    As the question indicates, the Persian collation rules order Arabic characters before Latin characters. In the collation rule set for Persian there is a rule [reorder Arab]. This rule is what you need to override.

    There are a few ways to do this:

    1. Use icu.RuleBasedCollator with a coustom set fo rules for Persian.
    2. Create a standard Persian collation, retrieve the rules, strip out the reorder directive and then use modified rules with icu.RuleBasedCollator.
    3. Create collator instance using a BCP-47 language tag, instead of a Locale identifier

    There are other approaches as well, but the third is the simplest:

    loc = Locale.forLanguageTag("fa-u-kr-latn-arab")
    collator = Collator.createInstance(loc)
    sorted(mylist, key=collator.getSortKey)
    

    This will reorder the Persian collation rules, placing Latin script before Arabic script, then everything else afterwards.

    Update 2024-06-27

    The reordering directive above reorders Latin first, then Arabic script, then everything else based on its default ordering.

    This works well for bilingual data in Persian and languages using the Latin script, but may not be as suitable for multiscript data.

    There is a special ISO 15924 code Zzzz representing Unknown script, as a ICU reorder code, it is used to represent all scripts not specifically specified in the reorder. So fa-u-kr-latn-arab would be the same as fa-u-kr-latn-arab-Zzzz, but if we use fa-u-kr-Zzzz without mentioning other codes, the collator will order scripts as per Root collation order. This would give us Persian specific sorting combined with the default script order of the Root collation:

    import icu
    data = ["Salâm", "سلام", "тасли́м", "Persian", "فارسی", "Персидский язык"]
    
    # Persian (Farsi) locale based collator
    loc_fa = loc = icu.Locale('fa')
    collator_fa = icu.Collator.createInstance(loc_fa)
    sorted(data, key=collator_fa.getSortKey)
    # ['سلام', 'فارسی', 'Persian', 'Salâm', 'Персидский язык', 'тасли́м']
    
    
    # Persian (Farsi) locale based collator with reordering: Latin, Arabic, then other scripts
    loc_alt = icu.Locale.forLanguageTag("fa-u-kr-latn-arab")
    collator_alt = icu.Collator.createInstance(loc_alt)
    sorted(data, key=collator_alt.getSortKey)
    # ['Persian', 'Salâm', 'سلام', 'فارسی', 'Персидский язык', 'тасли́м']
    
    # Persian (Farsi) locale based collator with reordering:  Other (Zzzz - Unknown script)
    # Sets order to default CLDR order
    loc = icu.Locale.forLanguageTag("fa-u-kr-Zzzz")
    collator = icu.Collator.createInstance(loc)
    sorted(data, key=collator.getSortKey)
    # ['Persian', 'Salâm', 'Персидский язык', 'тасли́м', 'سلام', 'فارسی'