postgresqlsql-order-bycollationalphabetical-sortaccent-sensitive

Accent sensitive sort in PostgreSQL


How to sort text in PostgreSQL in Latvian alphabetic order, so any letter without accent always comes before the letter with accent?

For example this

select * from (
    select 'bā' as name
    union select 'āa'
    union select 'ac'
    union select 'Āb'
    union select 'Ā'
) t
order by name COLLATE "lv-x-icu"

returns Ā āa Āb ac bā

but the expected result is ac Ā āa Āb bā

I have tried to create some custom collations but none of them returned the expected result.


Solution

  • This is an unusual requirement. In natural language collations (including the Latuvian collation), strings are compared in different steps:

    That explains the sort order you see: since a < aa < ab < ac <ba on the primary level, accents are not taken into account.

    If you want ac < Ā, you want accents to be considered different on the primary level, as if a and ā were different letters. For that, you'd have to use an ICU collation and customize it with additional rules. This is only possible from v16 on:

    CREATE COLLATION latuvian_custom (
       PROVIDER = icu,
       LOCALE = 'lv',
       RULES = '& a < ā & A < Ā & c < č & C < Č & e < ē & E < Ē & g < ģ & G < Ģ & k < ķ & K < Ķ & l < ļ & L < Ļ & n < ņ & N < Ņ & s < š & S < Š & u < ū & U < Ū & z < ž & Z < Ž'
    );
    
    select * from (
        select 'bā' as name
        union select 'āa'
        union select 'ac'
        union select 'Āb'
        union select 'Ā'
    ) t
    order by name COLLATE latuvian_custom;
    
     name 
    ══════
     ac
     Ā
     Āb
     āa
     bā
    (5 rows)
    

    The rules introduce a primary level difference (<) between the unaccented and the accented characters.