sqlsql-serverrecursive-queryadjacency-listmaterialized-path-pattern

Query an adjacency list in SQL via a materialized path


I have a large hierarchy (2,500+ records) stored in Microsoft SQL Server (2019) using an adjacency list model (e.g., Id, ParentId). I am looking for an efficient approach to look up a record based on a particular path in the hierarchy. In other words, given a path (e.g. /Root/FolderA/SubfolderA), I'd like to retrieve the Id associated with the final node (i.e., SubfolderA in this case).

Note: Node names are not globally unique. I.e., we can't just look for SubfolderA and assume it maps to /Root/FolderA/SubfolderA. There may be multiple nodes named SubfolderA within the hierarchy.

Setup

Hierarchy

/Root
  /FolderA
    /SubfolderA
    /SubfolderB
  /FolderB
    /SubfolderA
    /SubfolderB

Structure

CREATE 
TABLE   [dbo].[Tree] (
        [Id]            INT             NOT NULL PRIMARY KEY, 
        [ParentId]      INT             NULL, 
        [Name]          VARCHAR(255)    NOT NULL, 
        CONSTRAINT      [FK_Hierarchy]  
        FOREIGN KEY     (ParentId) 
        REFERENCES      [Tree]([Id])
)

Data

INSERT INTO Tree VALUES (1,    NULL, 'Root');
INSERT INTO Tree VALUES (2,    1,    'FolderA');
INSERT INTO Tree VALUES (3,    2,    'SubfolderA');
INSERT INTO Tree VALUES (4,    2,    'SubfolderB');
INSERT INTO Tree VALUES (5,    1,    'FolderB');
INSERT INTO Tree VALUES (6,    5,    'SubfolderA');
INSERT INTO Tree VALUES (7,    5,    'SubfolderB');

Naïve Approach

There are quite a few threads on how to convert an adjacency list to materialized paths, including:

View

We can use one of these approaches to convert the entire adjacency list into materialized paths using an rCTE:

CREATE 
VIEW            [dbo].[MaterializedPaths]
WITH            SCHEMABINDING
AS 

WITH RCTE AS (

  SELECT        Id,
                ParentId,
                CAST('/' + Name AS VARCHAR(255)) AS Path
  FROM          [dbo].[Tree] root
  WHERE         root.Id = 1

  UNION ALL

  SELECT        this.Id,
                this.ParentId,
                CAST(parent.Path + '/' + this.Name AS VARCHAR(255)) AS Path
  FROM          [dbo].[Tree] AS this
  INNER JOIN    RCTE parent
    ON          this.ParentId = parent.Id
)
SELECT          Id,
                Path
FROM            RCTE as hierarchy

Output

This produces the following output:

Id    Path
1     /Root
2     /Root/FolderA
3     /Root/FolderA/SubfolderA
4     /Root/FolderA/SubfolderB
5     /Root/FolderB
6     /Root/FolderB/SubfolderA
7     /Root/FolderB/SubfolderB

Query

We can filter that output using a simple WHERE clause:

SELECT          Id
FROM            MaterializedPaths
WHERE           Path = '/Root/FolderA/SubfolderA'

Problem

The naïve approach works fine. The problem is that it's incredibly inefficient—and, thus, slow—for querying large hierarchies, as it needs to dynamically reconstruct the entire set of materialized paths each call. In my case, that takes 8-9 seconds. Obviously, I could just store this data in a table and regenerate it via a trigger at any time the data changes. But I'd rather find a more efficient query and avoid the additional complexity.

Question

What is an efficient way of constructing this query? Or, at the risk of making this an XY problem, is there a way to limit the rCTE so that it only needs to evaluate the nodes in the hierarchy, instead of reconstructing the entire hierarchy each time?


Solution

  • Is there a way to limit the rCTE so that it only needs to evaluate the nodes in the hierarchy, instead of reconstructing the entire hierarchy each time?

    Limiting the rCTE

    There are a handful of approaches to limiting the scope of each recursive query so that it's only evaluating the relevant nodes in the hierarchy. A fairly efficient approach is to simply restrict the rCTE to records which the source path (let's call that @Path) starts with:

    INNER JOIN  RCTE recursive
      ON        this.ParentId = recursive.Id
      AND       @Path LIKE CAST(recursive.Path + '/' + this.Name AS VARCHAR(MAX)) + '%'
    

    This will limit the query to each record in your path:

    Id    Path
    1     /Root
    2     /Root/FolderA
    3     /Root/FolderA/SubfolderA
    

    Which can then be easily filtered down to the final record based on a simple WHERE clause:

    WHERE Path = @Path
    

    Packaging it as a function

    We can combine this with the original rCTE into a function. Putting it all together, it might look like:

    CREATE
    FUNCTION        [dbo].[GetIdFromPath]
    (
        @Path       VARCHAR(MAX)
    )
    RETURNS         INT
    AS
    
    BEGIN
    
      DECLARE       @Id         INT = -1
    
    
      ;WITH RCTE AS (
    
        SELECT      Id,
                    ParentId,
                    CAST('/' + Name AS VARCHAR(MAX)) AS Path
        FROM        [dbo].[Tree] root
        WHERE       root.Id = 1
    
        UNION ALL
    
        SELECT      this.Id,
                    this.ParentId,
                    CAST(parent.Path + '/' + this.Name AS VARCHAR(MAX)) AS Path
        FROM        [dbo].[Tree] AS this
        INNER JOIN  RCTE parent
          ON        Tree.ParentId = parent.Id
          AND       @Path LIKE CAST(parent.Path + '/' + this.Name AS VARCHAR(MAX)) + '%'
      )
      SELECT        @Id = Id
      FROM          RCTE as hierarchy
      WHERE         Path = @Path
    
      RETURN        @Id
    
    END
    

    Querying by path

    Given the above function, you can now query the adjacency list by simply passing the full path to the GetIdFromPath() function:

    SELECT          dbo.GetIdFromPath('/Root/FolderA/SubfolderA') AS Id
    

    Which, given the sample data from the original post, will return 3.

    Performance

    I've tested this approach against a table of comparable size, with 2,500 sample records, and it consistently executes well under a second, which is a dramatic improvement over the naïve approach. Obviously, you'll need to evaluate this against your own database and performance requirements to determine if its efficient enough.