pythonunicodelocale

Python not sorting unicode properly. Strcoll doesn't help


I've got a problem with sorting lists using unicode collation in Python 2.5.1 and 2.6.5 on OSX, as well as on Linux.

import locale   
locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]

Which should print:

[u'a', u'ą', u'z']

But instead prints out:

[u'a', u'z', u'ą']

Summing it up - it looks as if strcoll was broken. Tried it with various types of variables (fe. non-unicode encoded strings).

What do I do wrong?

Best regards, Tomasz Kopczuk.


Solution

  • Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (PyICU on PyPI).

    On OS X: sudo port install py26-pyicu, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).

    PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:

    import PyICU
    collator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))
    print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]
    

    which gives:

    [u'a', u'ą', u'z']
    

    Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.