How to sort text in PostgreSQL in Latvian alphabetic order, so any letter without accent always comes before the letter with accent?
For example this
select * from (
select 'bā' as name
union select 'āa'
union select 'ac'
union select 'Āb'
union select 'Ā'
) t
order by name COLLATE "lv-x-icu"
returns Ā āa Āb ac bā
but the expected result is ac Ā āa Āb bā
I have tried to create some custom collations but none of them returned the expected result.
This is an unusual requirement. In natural language collations (including the Latuvian collation), strings are compared in different steps:
In the first step, punctuation, case and accents are ignored. If the strings compare differently, that determines the sorting order. This is called a difference on the primary level.
If the strings are equal on the primary level, only punctuation and case are ignored, but accents matter. If the strings compare differently, that determines the sorting order. This is a difference on the secondary level.
If the strings are equal on the secondary level, only punctuation is ignored, but accents and case matter. This is the tertiary level.
Finally, if the strings are equal on the tertiary level, differences in punctuation (the quaternary level) matter.
That explains the sort order you see: since a < aa < ab < ac <ba
on the primary level, accents are not taken into account.
If you want ac < Ā
, you want accents to be considered different on the primary level, as if a
and ā
were different letters. For that, you'd have to use an ICU collation and customize it with additional rules. This is only possible from v16 on:
CREATE COLLATION latuvian_custom (
PROVIDER = icu,
LOCALE = 'lv',
RULES = '& a < ā & A < Ā & c < č & C < Č & e < ē & E < Ē & g < ģ & G < Ģ & k < ķ & K < Ķ & l < ļ & L < Ļ & n < ņ & N < Ņ & s < š & S < Š & u < ū & U < Ū & z < ž & Z < Ž'
);
select * from (
select 'bā' as name
union select 'āa'
union select 'ac'
union select 'Āb'
union select 'Ā'
) t
order by name COLLATE latuvian_custom;
name
══════
ac
Ā
Āb
āa
bā
(5 rows)
The rules introduce a primary level difference (<
) between the unaccented and the accented characters.