As you probably know, the order of alphabet in some (maybe most) languages is different than their order in Unicode. That's why we may want to use icu.Collator
to sort, like this Python example:
from icu import Collator, Locale
collator = Collator.createInstance(Locale("fa_IR.UTF-8"))
mylist.sort(key=collator.getSortKey)
This works perfectly for Persian strings. But it also sorts all Persian strings before all ASCII / English strings (which is the opposite of Unicode sort).
What if we want to sort ASCII before this given locale?
Or ideally, I want to sort by 2 or multiple locales. (For example give multiple Locale
arguments to Collator.createInstance
)
If we could tell collator.getSortKey
to return empty bytes for other locales, then I could create a tuple of 2 collator.getSortKey()
results, for example:
from icu import Collator, Locale
collator1 = Collator.createInstance(Locale("en_US.UTF-8"))
collator2 = Collator.createInstance(Locale("fa_IR.UTF-8"))
def sortKey(s):
return collator1.getSortKey(s), collator2.getSortKey(s)
mylist.sort(key=sortKey)
But looks like getSortKey
always returns non-empty bytes.
A bit late to answer the question, but here it is for future reference.
ICU collation uses the CLDR Collation Algorithm, which is a tailoring of the Unicode Collation Algorithm. The default collation is referred to as the root collation. Don't think in terms of Locales having a set of collation rules, think more in terms of locales specify any differences between the collation rules that the locale needs and the root collation. CLDR takes a minimalist approach, you only need to include the minimal set of differences needed based on the root collation.
English uses the root locale. No tailorings. Persian on the other hand has a few rules needed to override certain aspects of the root collation.
As the question indicates, the Persian collation rules order Arabic characters before Latin characters. In the collation rule set for Persian there is a rule [reorder Arab]
. This rule is what you need to override.
There are a few ways to do this:
icu.RuleBasedCollator
with a coustom set fo rules for Persian.icu.RuleBasedCollator
.There are other approaches as well, but the third is the simplest:
loc = Locale.forLanguageTag("fa-u-kr-latn-arab")
collator = Collator.createInstance(loc)
sorted(mylist, key=collator.getSortKey)
This will reorder the Persian collation rules, placing Latin script before Arabic script, then everything else afterwards.
Update 2024-06-27
The reordering directive above reorders Latin first, then Arabic script, then everything else based on its default ordering.
This works well for bilingual data in Persian and languages using the Latin script, but may not be as suitable for multiscript data.
There is a special ISO 15924 code Zzzz
representing Unknown script, as a ICU reorder code, it is used to represent all scripts not specifically specified in the reorder. So fa-u-kr-latn-arab
would be the same as fa-u-kr-latn-arab-Zzzz
, but if we use fa-u-kr-Zzzz
without mentioning other codes, the collator will order scripts as per Root collation order. This would give us Persian specific sorting combined with the default script order of the Root collation:
import icu
data = ["Salâm", "سلام", "тасли́м", "Persian", "فارسی", "Персидский язык"]
# Persian (Farsi) locale based collator
loc_fa = loc = icu.Locale('fa')
collator_fa = icu.Collator.createInstance(loc_fa)
sorted(data, key=collator_fa.getSortKey)
# ['سلام', 'فارسی', 'Persian', 'Salâm', 'Персидский язык', 'тасли́м']
# Persian (Farsi) locale based collator with reordering: Latin, Arabic, then other scripts
loc_alt = icu.Locale.forLanguageTag("fa-u-kr-latn-arab")
collator_alt = icu.Collator.createInstance(loc_alt)
sorted(data, key=collator_alt.getSortKey)
# ['Persian', 'Salâm', 'سلام', 'فارسی', 'Персидский язык', 'тасли́м']
# Persian (Farsi) locale based collator with reordering: Other (Zzzz - Unknown script)
# Sets order to default CLDR order
loc = icu.Locale.forLanguageTag("fa-u-kr-Zzzz")
collator = icu.Collator.createInstance(loc)
sorted(data, key=collator.getSortKey)
# ['Persian', 'Salâm', 'Персидский язык', 'тасли́м', 'سلام', 'فارسی'