postgresqlhierarchyltree

Validating LTREE hierarchies in PostgreSQL


I'm new to hierarchies in general and LTREE in particular. As I've been converting and loading a column of text-based hierarchies into an LTREE column, I noticed a poorly-formatted string.

create table test_tree(id int, path ltree);
insert into test_tree values (1, '1');
insert into test_tree values (1, '1.1');
insert into test_tree values (1, '1.2.0'); --should be '1.2'
insert into test_tree values (1, '1.2.1');
insert into test_tree values (1, '1.2.2.0'); --should be '1.2.2'
insert into test_tree values (1, '1.2.2.1');
insert into test_tree values (1, '1.2.2.2');

This results in some unexpected behavior.

select path from test_tree where path <@ '1';

returns descendants, i.e.:

1
1.1
1.2.0
1.2.1
1.2.2.0
1.2.2.1
1.2.2.2

Whereas:

select path from test_tree where path @> '1.2.2.2';

only returns

1.2.2.2

I would expect <@ '1' to returns results consistent with @> '1.2.2.2'. In this instance how can an ancestor know its descendants, but a descendant not know its ancestors? Why does <@ '1' return all offspring (seemingly ignoring the missing '1.2.2') but @> '1.2.2.2' return no ancestors?

Moreover, how can I find these missing relationships in LTREE datatypes?


Solution

  • The ltree operators do not care what values you have in your table or not. They only compare two ltree values. '1' @> '1.2.2.2' is true, '1.2.2.2' @> '1.2.2.2' is true, '1.1' @> '1.2.2.2' is not.

    But a SELECT query returns only the rows that actually exist in your table. '1.2' @> '1.2.2.2' and '1.2.2' @> '1.2.2.2' would have been true as well, however those two values do not exist in your table, so they cannot be found. The @>/<@ operators do not construct new rows.

    To actually construct all possible ancestors of an ltree value, not just those that are part of your table, you can use

    SELECT subpath(p, 0, generate_series(1, nlevel(p)))
    

    (online demo)

    You also seem to have assumed an implicit constraint from using ltree columns that the parent value exists in the same column of the table. This is not possible, in relational databases rows are independent from each other: an ltree value is not a reference to another row plus the last label, it really is just a list of labels; every row in your table stores the complete label path. Using a specific type for a column cannot introduce a constraint, you'd have to do that yourself - either as a complicated foreign key from a generated column, or using a trigger.

    How can I find these missing relationships in LTREE datatypes?

    You can find such missing rows in your table (not in the datatype) using

    SELECT path, array_agg(id) AS required_by
    FROM (
      SELECT id, subpath(path, 0, generate_series(1, nlevel(path)-1)) AS path
      FROM test_tree
      ORDER BY path, id
    ) AS all_parent_paths
    WHERE NOT EXISTS (SELECT * FROM test_tree WHERE path = all_parent_paths.path)
    GROUP BY path
    

    (online demo)