delphiunicodedelphi-xe2collationuca

Is there Unicode Collation Algorithm (UCA) code for Delphi?


Collation under the Unicode Technical Standard #10 (UCA), which is a separate thing from being Unicode Compliant, in case you were wondering about that, implies not only ordering/sorting but also comparison, questions of "is string 1 equal to string 2". Sometimes code points which are not the same value in both strings are to be considered equal for collation and comparison purposes, at least that is implied by this blog post which is talking from a Perl standard library perspective.

What I want to know is, does (a) Delphi XE2 already fully implement the entire Unicode Collation Spec, and (b) if not, does a third party library do so?

Sample code:

Str1 := Chr($212B);
Str2 := Chr($C5);
n := CompareStr(Str1,Str2); // in delphi this is not zero, under UCA rules, should be 0.

According to the Unicode collation spec, Unicode collation should consider all the above codepoints equivalent under comparison. That makes no sense from a binary point of view, and so I'm glad that neither CompareStr in Delphi, nor cmp in perl (from the linked article) are polluted with Unicode glitches, but what if you want to do a unicode-compliant collation in Delphi, like the perl Unicode::Collation library? How?

Update AnsiCompareStr would call the Win32 CompareString and would handle some locale specific cases like the above, and from reading around the internet, the classic Windows unicode collation behaviour and UCA are converging slowly but not completely, with UCA seeming to be the one that gets changed to make it more like Windows collation.


Solution

  • (a) No. Delphi's AnsiCompareStr and co. wrap the Win32 CompareString function, which does not follow the Unicode collation algorithm.

    (b) The ICU project does support it, but the Delphi wrapper, ICU4PAS, hasn't been updated since 2007.

    That may not be necessary for you though. The reason you're seeing the behavior you are is because you're using CompareStr instead of AnsiCompareStr. The non-ANSI version is written in asm in SysUtils, compares char-by-char, and doesn't take equivalence or combining characters into account. The case insensitive version, CompareText, also only works with a-z. The ANSI versions call CompareString internally which is locale-aware and does handle all of those cases.

    Note that that's only true for the routines in SysUtils though. In StrUtils.pas the non-ANSI versions are just inline wrappers around the ANSI ones, so they are all locale aware.