sqlpostgresqlpieclouddb

How to get the complete path with CTE


I have a table like below:

ID NextID
1 5
2 NULL
3 6
4 7
5 8
6 9
7 NULL
8 NULL
9 10
10 NULL

I want to get the ID path:

1 --> 5 --> 8
2
3 --> 6 --> 9 --> 10
4 --> 7

and I tried this:

WITH RECURSIVE path_cte AS (
  SELECT ID, NextID, ID::TEXT AS Path
  FROM t1
  WHERE NextID IS NULL
  UNION ALL
  SELECT t1.ID, t1.NextID, t1.ID || ' --> ' || cte.Path
  FROM t1
  JOIN path_cte cte ON t1.NextID = cte.ID
)
SELECT Path
FROM path_cte
ORDER BY ID;

but I got the output:

1 --> 5 --> 8
2
3 --> 6 --> 9 --> 10
4 --> 7
5 --> 8
6 --> 9 --> 10
7
8
9 --> 10
10

I don't want to get those incomplete paths, but I don't know how to achieve that.

The path is based on ID and NextID, if NextID is NULL, the path ends, and this ID is the end of the path, the next is to trace forward to get a complete path.

If an ID has neither a preceding ID nor a subsequent ID, it is also considered a path.

The database I use is compatible with PostgreSQL syntax, so you can also use PostgreSQL to demo, thanks :)


Solution

  • If you can afford a versatile, battle-tested, highly optimised (Boost C++ level of optimised) extension which pgrouting is, what I'd also consider is not trying to solve a long-solved problem. Your data set can be interpreted as a graph, and that's what this extension does: graphs.

    The problem you're dealing with two calls. One pgr_connectedComponents():

    A connected component of an undirected graph is a set of vertices that are all reachable from each other.

    alter table t1 add column subgraph integer;
    
    explain analyze verbose
    update t1 set subgraph=c.component
    from pgr_connectedComponents(
      $x$ select row_number()over() as id
               , coalesce(id,nextid,1e7::int) as source
               , coalesce(nextid,id,1e7::int) as target
               , 1      as cost
               , 1      as reverse_cost
          from t1 
      $x$) as c
    where c.node=t1.id;--4087.475 ms
    
    create index on t1(subgraph,id,nextid)with(fillfactor=100);
    

    That establishes individual subgraphs/trees/chains/disjoined groups of nodes. Once you have that, you can run any routing or spanning tree algorithm you want. E.g. pgr_dijkstra():

    select string_agg(node::text,' --> ' order by path_seq) as path
    from(select subgraph
              , id as bottom
              , (select id from t1 as t2 
                 where t1.subgraph=t2.subgraph
                 and t2.nextid is null) as top
         from t1 
         where id is not null 
           and not exists(select from t1 as t2 
                          where t2.nextid=t1.id)) as top_bottom_per_subgraph
    cross join lateral pgr_dijkstra(
          format($x$ select row_number()over() as id
                       , coalesce(id,nextid,2e9::int) as source
                       , coalesce(nextid,id,2e9::int) as target
                       , 1 as cost
                       , 1 as reverse_cost
                  from t1
                  where id is not null
                    and subgraph=%L 
                 $x$, top_bottom_per_subgraph.subgraph)
        ,top_bottom_per_subgraph.bottom
        ,top_bottom_per_subgraph.top)
    group by subgraph;--4058.037 ms
    

    It does a single run between the top and bottom nodes in each subgraph. In each case you tell pgrouting what you want and leave worrying how to do that efficiently to decades of C++ freaks.

    On 1'000'000 randomised rows with the longest chain/largest group of 100k, the first part takes 4087ms, the second 4058ms.

    db<>fiddle, db-fiddle and sqlfiddle don't offer pgrouting, which is why I couldn't attach an online demo but when I run the answers posted here so far on the same env, these are the results:

    variant score
    pgr_topologicalSort on a DAG 3'546ms
    Dijkstra on subgraphs (this) 8'145ms
    ValNik 44'912ms
    hobgoblin 182'969 ms
    Patrick Chin 210'174 ms
    Jonas Metzler timed out after 12h