sql-servergraph-theory

Graph traversal: find final destination


I have a table in SQL Server that represents nodes directions. Each node can either go into another node or stay as it is. I need to resolve the final node to which each node should go.

Here is an example:

CREATE TABLE dbo.NodeDirections 
(
    [Start] VARCHAR(10),
    [End] VARCHAR(10)
);

INSERT INTO dbo.NodeDirections ([Start], [End]) 
VALUES ('A1', NULL), ('A2', 'A1'), ('A3', 'A2'), ('A5', NULL);

In this example:

I want to create a query that resolves the final node to which each node should be god.

The expected result should be:

Start   End
-----------
  A1    A1
  A2    A1
  A3    A1
  A4    A1
  A5    A5

How can I achieve this in SQL Server?


Solution

  • As your graph can have any level of depth, you have no other choice than using recursive CTE.

    Theory

    A CTE (Common Table Expression), also called "WITH expression", is an ad-hoc table-like structure that you can refer (as any "real" table of your database) from a SELECT; the CTE lasts only for the duration of this main SELECT, so you can consider it as a really temporary table.

    WITH
      cte0 AS (SELECT a+b AS r FROM realtable),
      cte1 AS (SELECT b, r + 1 AS rinc FROM cte0 JOIN realtable r ON r.a = cte0.r) -- You can chain CTEs, join them to real tables, and so on.
    SELECT * FROM cte1; -- And the main query can use them.
    

    A recursive CTE is a special kind of CTE, where one CTE can SELECT from itself… well, in fact, it can SELECT from the previous iteration of itself.

    A quick example would be emitting all numbers from 0 to 10:

    WITH r AS
    (
      SELECT 0 AS i -- ← Initialization, which does only depend on other tables (in fact on no table in our case)
      UNION ALL -- ← Separates the initialization (iteration 0) to the recursive part below.
      SELECT i + 1 FROM r -- ← Add 1 to the i of the previous iteration of r, to get the value of i for this iteration
      WHERE i < 10 -- ← <, not <=, else this would emit a last row containing 11.
    )
    SELECT * FROM r;
    

    On to your problem

    In the theorical case above, we just wanted to list down every possible number, but did not bother having a memory of the path to reach that number.

    In your case (and in general in any graph walkthrough), you will want some columns just hold data fetched at this iteration, some columns store the state at this iteration (next parent to fetch), and some columns related to the full walkthrough (what was the starting point? How many iterations did I go through?).

    So at iteration 0 we will probably have to duplicate one column, because its use will diverge at later iterations.
    For example your Start column in a bottom-to-top walkthrough will be both the starting point ("of full walkthrough interest") and the root value to keep in case we have to stop immediately, with no End ("info fetched at this iteration").

    Bottom to top

    To be sure not to forget any node, you can have your iteration 0 list all nodes, and the recursively go up:

    WITH
      r AS
      (
        SELECT [Start] id, [Start] cur, [End] parent FROM NodeDirections
        UNION ALL
        SELECT
          r.id,              -- Stable part of your recursion: remember who we're working for.
          p.[Start], p.[End] -- Evolving part of the recusion: add a new row where (for the same id) we now have the contents of one level upper, compared to the previous iteration.
        FROM r JOIN NodeDirections p ON p.[Start] = r.parent
        -- No need of a WHERE: as your graph makes sure there is no loop, the recursion will stop when reaching a NULL parent (which joins to nothing in NodeDirections)
      )
    SELECT id AS [Start], cur AS [End] FROM r
    -- Remember we said that once outside of r, we would see every iteration ever stored in the internal buffer?
    -- so to get a clean result, we now only want to select the last reached state, that is, when the recursion could not emit rows anymore: when [End] (last seen in r.parent) was NULL.
    WHERE parent IS NULL;
    

    Returns:

    Start End
    A1 A1
    A5 A5
    A4 A1
    A3 A1
    A2 A1

    or, with a diagnosing column giving the iteration number to see how iterations gradually dried up:

    id cur parent iteration
    A1 A1 null 0
    A2 A2 A1 0
    A3 A3 A2 0
    A4 A4 A3 0
    A5 A5 null 0
    A2 A1 null 1
    A3 A2 A1 1
    A4 A3 A2 1
    A3 A1 null 2
    A4 A2 A1 2
    A4 A1 null 3

    You can see by yourself in a fiddle running on your example.

    Infinite loops

    Although SQL Server provides a default protection against infinite loops, you should always ensure that your data forbids infinite loop, else prevent it from your recursive part.

    Imagine 1 having an [End] of 3, you would have 1 ← 2 ← 3 ← 1 ← 2 …).

    To prevent that, a simple watchdog in your recursive part could be as simple as WHERE id <> [End]:
    when, while fetching a grand-grand-grand-…parent for 1 we find 1 itself, we have a problem. The WHERE would discard this would-be trouble maker.

    Bottom to top

    But in your case, as you have no loops, you can be sure that nodes having no parent are the only heads of the graph, by which you'll be able to reach every node.

    So, starting from it, we could build a lighter tree; while in a bottom to top traversal we passed multiple times into 2 (for itself on iteration 0, then as a parent of 3, then grand-parent of 4), with a top to bottom traversal we will reach each node exactly once.

    WITH
      r AS
      (
        SELECT [Start], [Start] AS [End] FROM NodeDirections WHERE [End] IS NULL
        UNION ALL
        SELECT c.[Start], r.[End]
        FROM r JOIN NodeDirections c ON c.[End] = r.[Start]
      )
    SELECT * FROM r; -- No need to filter: the internal buffer stored exactly one row, the one we were interested in, for each node!
    

    And you'll see it in a augmented fiddle.