icuuca

What is the theory behind unicode collation sorting


What is the theory behind unicode sorting? I understand how it works, but I don't understand why they decided on this standard for collation sorting.

It seems that when you have two strings to compare, using ucol_strcolliter() for example:

ucol_strcollIter(collator, &stringIter1, &stringIter2, &Status)

Then, say I the two strings are:

string string1 = "hello"
string string2 = "héllo"

Under the "Secondary" collation strength, string1 should be ordered before string2. Where string1 and string2 are compared on their secondary strength.

<1 hello
<2 héllo

BUT

If you have trailing spaces, like:

string string1 = "hello  "
string string2 = "héllo "

then the accented hello (string2) will be placed before string1. And, both are compared on their primary weight.

<1 héllo  
<1 hello 

Why does the unicode collation algorithm take into account the trailing spaces?

Is there some reason behind this?


Solution

  • Probably the best TP would be this.

    You can try various option combinations with the ICU Collation Demo. (give "alternate=shifted" a try)