sqlsql-serverrecursiongraphcommon-table-expression

Prevent recursive CTE visiting nodes multiple times


Consider the following simple DAG:

  1->2->3->4

And a table, #bar, describing this (I'm using SQL Server 2005):

parent_id   child_id
1           2
2           3
3           4
//... other edges, not connected to the subgraph above

Now imagine that I have some other arbitrary criteria that select the first and last edges, i.e. 1->2 and 3->4. I want to use these to find the rest of my graph.

I can write a recursive CTE as follows (I'm using terminology from MSDN):

with foo(parent_id,child_id) as (
// anchor member that happens to select first and last edges:
select parent_id,child_id from #bar where parent_id in (1,3)
union all
// recursive member:
select #bar.* from #bar
join foo on #bar.parent_id = foo.child_id
)
select parent_id,child_id from foo

However, this results in edge 3->4 being selected twice:

parent_id  child_id
1          2
3          4
2          3
3          4    // 2nd appearance!

How can I prevent the query from recursing into subgraphs that have already been described? I could achieve this if, in my "recursive member" part of the query, I could reference all data that has been retrieved by the recursive CTE so far (and supply a predicate indicating in the recursive member excluding nodes already visited). However, I think I can access data that was returned by the last iteration of the recursive member only.

This will not scale well when there is a lot of such repetition. Is there a way of preventing this unnecessary additional recursion?

Note that I could use "select distinct" in the last line of my statement to achieve the desired results, but this seems to be applied after all the (repeated) recursion is done, so I don't think this is an ideal solution.

Edit - hainstech suggests stopping recursion by adding a predicate to exclude recursing down paths that were explicitly in the starting set, i.e. recurse only where foo.child_id not in (1,3). That works for the case above only because it simple - all the repeated sections begin within the anchor set of nodes. It doesn't solve the general case where they may not be. e.g., consider adding edges 1->4 and 4->5 to the above set. Edge 4->5 will be captured twice, even with the suggested predicate. :(


Solution

  • The CTE's are recursive.

    When your CTE's have multiple initial conditions, that means they also have different recursion stacks, and there is no way to use information from one stack in another stack.

    In your example, the recursion stacks will go as follows:

    (1) - first IN condition
    (1, 2)
    (1, 2, 3)
    (1, 2, 3, 4)
    (1, 2, 3) - no more children
    (1, 2) - no more children
    (1) - no more children, going to second IN condition
    
    (3) - second condition
    (3, 4)
    (3) - no more children, returning
    

    As you can see, these recursion stack do not intersect.

    You could probably record the visited values in a temporary table, JOIN each value with the temptable and do not follow this value it if it's found, but SQL Server does not support these things.

    So you just use SELECT DISTINCT.