I have a large hierarchy (2,500+ records) stored in Microsoft SQL Server (2019) using an adjacency list model (e.g., Id
, ParentId
). I am looking for an efficient approach to look up a record based on a particular path in the hierarchy. In other words, given a path (e.g. /Root/FolderA/SubfolderA
), I'd like to retrieve the Id
associated with the final node (i.e., SubfolderA
in this case).
Note: Node names are not globally unique. I.e., we can't just look for
SubfolderA
and assume it maps to/Root/FolderA/SubfolderA
. There may be multiple nodes namedSubfolderA
within the hierarchy.
/Root
/FolderA
/SubfolderA
/SubfolderB
/FolderB
/SubfolderA
/SubfolderB
CREATE
TABLE [dbo].[Tree] (
[Id] INT NOT NULL PRIMARY KEY,
[ParentId] INT NULL,
[Name] VARCHAR(255) NOT NULL,
CONSTRAINT [FK_Hierarchy]
FOREIGN KEY (ParentId)
REFERENCES [Tree]([Id])
)
INSERT INTO Tree VALUES (1, NULL, 'Root');
INSERT INTO Tree VALUES (2, 1, 'FolderA');
INSERT INTO Tree VALUES (3, 2, 'SubfolderA');
INSERT INTO Tree VALUES (4, 2, 'SubfolderB');
INSERT INTO Tree VALUES (5, 1, 'FolderB');
INSERT INTO Tree VALUES (6, 5, 'SubfolderA');
INSERT INTO Tree VALUES (7, 5, 'SubfolderB');
There are quite a few threads on how to convert an adjacency list to materialized paths, including:
We can use one of these approaches to convert the entire adjacency list into materialized paths using an rCTE:
CREATE
VIEW [dbo].[MaterializedPaths]
WITH SCHEMABINDING
AS
WITH RCTE AS (
SELECT Id,
ParentId,
CAST('/' + Name AS VARCHAR(255)) AS Path
FROM [dbo].[Tree] root
WHERE root.Id = 1
UNION ALL
SELECT this.Id,
this.ParentId,
CAST(parent.Path + '/' + this.Name AS VARCHAR(255)) AS Path
FROM [dbo].[Tree] AS this
INNER JOIN RCTE parent
ON this.ParentId = parent.Id
)
SELECT Id,
Path
FROM RCTE as hierarchy
This produces the following output:
Id Path
1 /Root
2 /Root/FolderA
3 /Root/FolderA/SubfolderA
4 /Root/FolderA/SubfolderB
5 /Root/FolderB
6 /Root/FolderB/SubfolderA
7 /Root/FolderB/SubfolderB
We can filter that output using a simple WHERE
clause:
SELECT Id
FROM MaterializedPaths
WHERE Path = '/Root/FolderA/SubfolderA'
The naïve approach works fine. The problem is that it's incredibly inefficient—and, thus, slow—for querying large hierarchies, as it needs to dynamically reconstruct the entire set of materialized paths each call. In my case, that takes 8-9 seconds. Obviously, I could just store this data in a table and regenerate it via a trigger at any time the data changes. But I'd rather find a more efficient query and avoid the additional complexity.
What is an efficient way of constructing this query? Or, at the risk of making this an XY problem, is there a way to limit the rCTE so that it only needs to evaluate the nodes in the hierarchy, instead of reconstructing the entire hierarchy each time?
Is there a way to limit the rCTE so that it only needs to evaluate the nodes in the hierarchy, instead of reconstructing the entire hierarchy each time?
There are a handful of approaches to limiting the scope of each recursive query so that it's only evaluating the relevant nodes in the hierarchy. A fairly efficient approach is to simply restrict the rCTE to records which the source path (let's call that @Path
) starts with:
INNER JOIN RCTE recursive
ON this.ParentId = recursive.Id
AND @Path LIKE CAST(recursive.Path + '/' + this.Name AS VARCHAR(MAX)) + '%'
This will limit the query to each record in your path:
Id Path
1 /Root
2 /Root/FolderA
3 /Root/FolderA/SubfolderA
Which can then be easily filtered down to the final record based on a simple WHERE
clause:
WHERE Path = @Path
We can combine this with the original rCTE into a function. Putting it all together, it might look like:
CREATE
FUNCTION [dbo].[GetIdFromPath]
(
@Path VARCHAR(MAX)
)
RETURNS INT
AS
BEGIN
DECLARE @Id INT = -1
;WITH RCTE AS (
SELECT Id,
ParentId,
CAST('/' + Name AS VARCHAR(MAX)) AS Path
FROM [dbo].[Tree] root
WHERE root.Id = 1
UNION ALL
SELECT this.Id,
this.ParentId,
CAST(parent.Path + '/' + this.Name AS VARCHAR(MAX)) AS Path
FROM [dbo].[Tree] AS this
INNER JOIN RCTE parent
ON Tree.ParentId = parent.Id
AND @Path LIKE CAST(parent.Path + '/' + this.Name AS VARCHAR(MAX)) + '%'
)
SELECT @Id = Id
FROM RCTE as hierarchy
WHERE Path = @Path
RETURN @Id
END
Given the above function, you can now query the adjacency list by simply passing the full path to the GetIdFromPath()
function:
SELECT dbo.GetIdFromPath('/Root/FolderA/SubfolderA') AS Id
Which, given the sample data from the original post, will return 3
.
I've tested this approach against a table of comparable size, with 2,500 sample records, and it consistently executes well under a second, which is a dramatic improvement over the naïve approach. Obviously, you'll need to evaluate this against your own database and performance requirements to determine if its efficient enough.