postgresqlhierarchical-datarecursive-querypostgresql-12collate

How to hierarchical query in PostgreSQL with language-dependent data


I am trying to retrieve a hierarchical ordered result from a query on an auto referenced table like this:

create table category (
  id          serial,
  
  -- parent category, 
  parent_id   integer default null, -- null for root category
  
  -- tree control
  tree_depth  smallint not null, -- calculated

  primary key (id),
  unique (parent_id, id),
  foreign key (parent_id) references category (id)
);

This is a common approach to store a tree of categories except for the need for supporting multiple languages. To that aim, we join a language-dependent table like this:

create table category_lang (
  id            serial,
  
  -- natural primary key
  category_id   integer not null,
  lang_code     char(2) not null,
  
  -- language-dependent data
  title         varchar(128) not null,
  
  primary key (id),
  unique (category_id, lang_code)
);

The tree_depth column is calculated in a before insert trigger like this:

create or replace function fn_category__bins () returns trigger as $$
begin
  -- calculate tree_depth as parent tree_depth + 1
  if new.parent_id is null then
    new.tree_depth = 0;
  else
    new.tree_depth = (select tree_depth from category where id = new.parent_id limit 1) + 1;
  end if;

  return new;
end;
$$ language plpgsql;

create trigger tg_category__bins before insert on category for each row
execute procedure fn_category__bins();

We populate the tables with easy to read texts in two languages:

insert into category (parent_id, id) values 
(null, 1),
(null, 2),
(null, 3),

(1, 11),
(1, 12),
(1, 13),

(2, 21),
(2, 22),
(3, 31),

(21, 211),
(21, 212),
(21, 213);

-- lang_code = 'EN'
insert into category_lang (category_id, title, lang_code) values 
(1, 'One',   'EN'),
(2, 'Two',   'EN'),
(3, 'Three', 'EN'),

(11, 'One.One',   'EN'),
(12, 'One.Two',   'EN'),
(13, 'One.Three', 'EN'),

(21, 'Two.One',   'EN'),
(22, 'Two.Two',   'EN'),
(31, 'Three.One', 'EN'),

(211, 'Two.One.One',   'EN'),
(212, 'Two.One.Two',   'EN'),
(213, 'Two.One.Three', 'EN');

-- lang_code = 'ES'
insert into category_lang (category_id, title, lang_code) values 
(1, 'Uno',  'ES'),
(2, 'Dos',  'ES'),
(3, 'Tres', 'ES'),

(11, 'Uno.Uno',  'ES'),
(12, 'Uno.Dos',  'ES'),
(13, 'Uno.Tres', 'ES'),

(21, 'Dos.Uno',  'ES'),
(22, 'Dos.Dos',  'ES'),
(31, 'Tres.Uno', 'ES'),

(211, 'Dos.Uno.Uno',  'ES'),
(212, 'Dos.Uno.Dos',  'ES'),
(213, 'Dos.Uno.Tres', 'ES');

A plain query produces a natural result like this:

select * from category tc 
left outer join category_lang tl on tl.category_id = tc.id and tl.lang_code = 'EN';

id |parent_id|tree_depth|id|category_id|lang_code|title        |
---|---------|----------|--|-----------|---------|-------------|
  1|         |         0| 1|          1|EN       |One          |
  2|         |         0| 2|          2|EN       |Two          |
  3|         |         0| 3|          3|EN       |Three        |
 11|        1|         1| 4|         11|EN       |One.One      |
 12|        1|         1| 5|         12|EN       |One.Two      |
 13|        1|         1| 6|         13|EN       |One.Three    |
 21|        2|         1| 7|         21|EN       |Two.One      |
 22|        2|         1| 8|         22|EN       |Two.Two      |
 31|        3|         1| 9|         31|EN       |Three.One    |
211|       21|         2|10|        211|EN       |Two.One.One  |
212|       21|         2|11|        212|EN       |Two.One.Two  |
213|       21|         2|12|        213|EN       |Two.One.Three|

when the expected order should be compliant with tree hierarchy and alphabetical order in English (at every depth level), like this:
[Edited to fix the error identified by Erwin]

id |parent_id|tree_depth|id|category_id|lang_code|title        |
---|---------|----------|--|-----------|---------|-------------|
  1|         |         0| 1|          1|EN       |One          |
 11|        1|         1| 4|         11|EN       |One.One      |
 13|        1|         1| 6|         13|EN       |One.Three    |
 12|        1|         1| 5|         12|EN       |One.Two      |
  3|         |         0| 3|          3|EN       |Three        |
 31|        3|         1| 9|         31|EN       |Three.One    |
  2|         |         0| 2|          2|EN       |Two          |
 21|        2|         1| 7|         21|EN       |Two.One      |
211|       21|         2|10|        211|EN       |Two.One.One  |
213|       21|         2|12|        213|EN       |Two.One.Three|
212|       21|         2|11|        212|EN       |Two.One.Two  |
 22|        2|         1| 8|         22|EN       |Two.Two      |

Note that the alphabetical order at every depth forces a different result for Spanish:
[Edited to fix the error identified by Erwin]

id |parent_id|tree_depth|id|category_id|lang_code|title       |
---|---------|----------|--|-----------|---------|------------|
  2|         |         0|14|          2|ES       |Dos         |
 22|        2|         1|20|         22|ES       |Dos.Dos     |
 21|        2|         1|19|         21|ES       |Dos.Uno     |
212|       21|         2|23|        212|ES       |Dos.Uno.Dos |
213|       21|         2|24|        213|ES       |Dos.Uno.Tres|
211|       21|         2|22|        211|ES       |Dos.Uno.Uno |
  1|         |         0|13|          1|ES       |Uno         |
 12|        1|         1|17|         12|ES       |Uno.Dos     |
 13|        1|         1|18|         13|ES       |Uno.Tres    |
 11|        1|         1|16|         11|ES       |Uno.Uno     |
  3|         |         0|15|          3|ES       |Tres        |
 31|        3|         1|21|         31|ES       |Tres.Uno    |

I have tried a number of approaches, including a recursive CTE as in https://www.postgresql.org/docs/12/queries-with.html, but none seems to cope with the problem of different orders for different languages.

Any ideas?


Solution

  • ... the expected order should be compliant with tree hierarchy and alphabetical order in English (at every depth level),

    The additional difficulty is that category_lang(title, lang_code) is not defined UNIQUE, so we need to sort by title and category_id (as tiebreaker) on every level - which is hard to implement for a dynamic number of levels. An array of composite type can solve the conundrum.

    Your displayed results do not currently comply with your requirement. 'Three' should sort before 'Two' according to English sorting rules. The result of the following query implements your requirement:

    Create once per database:

    CREATE TYPE title_id AS (title varchar(128), id int);
    

    Then use a recursive CTE to generate an array of this composite type according to its path.

    WITH RECURSIVE tree AS (
       SELECT c.id AS cat_id, c.parent_id, c.tree_depth
            , l.id AS lang_id, l.title, l.lang_code
            , ARRAY[(l.title, l.category_id)::title_id] AS sort_arr
       FROM   category      c 
       JOIN   category_lang l ON l.category_id = c.id
                             AND l.lang_code = 'EN'
       WHERE  c.parent_id IS NULL  -- root cat
    
       UNION ALL
       SELECT c.id AS cat_id, c.parent_id, c.tree_depth
            , l.id AS lang_id, l.title, l.lang_code
            , sort_arr || (l.title, l.category_id)::title_id
       FROM   tree          t
       JOIN   category      c ON c.parent_id = t.cat_id
       JOIN   category_lang l ON l.category_id = c.id
                             AND l.lang_code = t.lang_code
        )
    SELECT cat_id, parent_id, tree_depth, lang_id, title 
    FROM   tree
    ORDER  BY sort_arr;
    

    db<>fiddle here

    Closely related with more explanation and details:

    COLLATE?

    But that's not all. The simple solution sorts by the default collation of your database, which seems inappropriate for different languages.

    Every language has its own collation rules, or typically several of them, depending on the region of the world and other political / cultural rules. The "language" is not enough to specify exact rules for sorting. The precise locale matters. Postgres implements collation-aware sorting with the COLLATE keyword. You would have to store the actual precise collation in addition to the language, and use it to sort properly.

    Also, indexes depend on the exact COLLATION. You might consider multiple partial indexes with different collation. Lots of tricky stuff that goes beyond the scope of this question. See:

    Asides