sqlpostgresqlrecursiontreestructure

Select/filter tree structure in postgres


I am working one a new feature for a legacy system. This system uses two tables to keep 'documents' and to keep the relationship among documents ('relation'). This 'relation' table creates something like a tree structure.

We need to list all documents that are classified as valid types and that do not have any other documents of valid types as their ancestors.

Document (~500 million records in production)


id:   |  type
1   invalid_type
2   valid_type_a
3   invalid_type
4   valid_type_a
5   invalid_type
6   valid_type_b
7   invalid_type
8   valid_type_b
9   invalid_type
10  valid_type_a
11  valid_type_a
12  invalid_type
13  invalid_type
14  invalid_type
15  valid_type_b

Relation (~500 million records in production)


relationId | parentDocumentId | childDocumentId
1       1       2
2       1       3
3       2       4
4       2       5
5       3       6
6       6       7
7       8       9
8       9       10
9       12      13
10      13      14
11      13      15

structure

Out of these tables, I need to list all documents of valid types and that do not have an ancestor document of any valid type (in any level).

The expected result is: 2, 6, 8, 11, 15

4 is a valid document, but has 2 as its parent.

10 is a valid document, but has 8 as its ancestor.


Although I could eventually change this structure, maybe even migrate it to another database (nosql...), the focus now is to develop a new feature using the current model.

I have been playing with recursive queries, but could not get all things together yet. I could also select and filter the records to some extent and then apply the remaining rules in my code.

Any help or tips pointing to any direction are much appreciated.


CREATE TABLE IF NOT EXISTS document
(
    id int not null,
    type varchar not null,
    CONSTRAINT document_pk PRIMARY KEY (id)
);

CREATE TABLE IF NOT EXISTS relation
(
    id int not null,
    parent_id int not null,
    child_id int not null,
    CONSTRAINT relation_pk PRIMARY KEY (id),
    CONSTRAINT parent_fk FOREIGN KEY (parent_id)
        REFERENCES document (id),
    CONSTRAINT child_fk FOREIGN KEY (child_id)
        REFERENCES document (id)
);


INSERT INTO document (id, type)
    VALUES 
        (1, 'invalid_type'), 
        (2, 'valid_type_a'), 
        (3, 'invalid_type'), 
        (4, 'valid_type_a'), 
        (5, 'invalid_type'), 
        (6, 'valid_type_b'), 
        (7, 'invalid_type'), 
        (8, 'valid_type_b'), 
        (9, 'invalid_type'), 
        (10, 'valid_type_a'), 
        (11, 'valid_type_a'), 
        (12, 'invalid_type'), 
        (13, 'invalid_type'), 
        (14, 'invalid_type'), 
        (15, 'valid_type_b');


INSERT INTO relation (id, parent_id, child_id)
    VALUES 
        (1, 1, 2), 
        (2, 1, 3),
        (3, 2, 4),
        (4, 3, 5),
        (5, 3, 6),
        (6, 6, 7),
        (7, 8, 9),
        (8, 9, 10),
        (9, 12, 13),
        (10, 13, 14),
                (11, 13, 15);

Solution

  • From each candidate node you can walk the tree upward and find if there's a valid ancestor. For example:

    with recursive
    n (id, lvl, cid, burned) as (
      select id, 1, id, false from document where type like 'v%'
     union all
      select n.id, n.lvl + 1, d.id, d.type like 'v%'
      from n
      join relation r on r.childDocumentId = n.cid and not n.burned
      join document d on d.id = r.parentDocumentId
    )
    select *
    from (
      select distinct on (id) * from n order by id, lvl desc
    ) x
    where not burned;
    

    Result:

     id  lvl  cid  burned 
     --- ---- ---- ------ 
     2   2    1    f      
     6   3    1    f      
     8   1    8    f      
     11  1    11   f      
     15  3    12   f      
    

    See running example at db<>fiddle.

    Note: It assumes there are no circular references.

    The following indexes can help with performance:

    create index ix1 on relation (childDocumentId, parentDocumentId);
    
    create index ix2 on document (id);