sqlsql-serversql-server-2008

How to find all connected subgraphs of an undirected graph


I need some help for a problem that i am struggling to solve.

Example table:

ID |Identifier1 | Identifier2 
---------------------------------
1  |      a     | c         
2  |      b     | f         
3  |      a     | g         
4  |      c     | h        
5  |      b     | j         
6  |      d     | f         
7  |      e     | k  
8  |      i     |          
9  |      l     | h    

I want to group identifiers that are related with each other between two columns and assign a unique group id.

Desired Output:

Identifier | Gr_ID    |    Gr.Members                 
---------------------------------------------------
a       |      1      |   (a,c,g,h,l)  
b       |      2      |   (b,d,f,j)       
c       |      1      |   (a,c,g,h,l)  
d       |      2      |   (b,d,f,j)       
e       |      3      |   (e,k)                 
f       |      2      |   (b,d,f,j)       
g       |      1      |   (a,c,g,h,l)  
h       |      1      |   (a,c,g,h,l)  
j       |      2      |   (b,d,f,j)       
k       |      3      |   (e,k)                 
l       |      1      |   (a,c,g,h,l)  
i       |      4      |   (i)  

Note:the column Gr.Members is not necessary, mostly is used for a clearer view.

So the definition for a group is: A row belongs to a group if it shares at least one identifier with at least one row of this group

But the group id has to be assigned to each identifier(selected by the union of the two columns) not to the row.

Any help on how to build a query to give the desired output?

Thank you.


Update: Below are some extra sample sets with their expected output.


Given table:

Identifier1 | Identifier2   
----------------------------
    a       |   f
    a       |   g
    a       |  NULL
    b       |   c
    b       |   a
    b       |   h
    b       |   j
    b       |  NULL
    b       |  NULL
    b       |   g
    c       |   k
    c       |   b
    d       |   l
    d       |   f
    d       |   g
    d       |   m
    d       |   a
    d       |  NULL
    d       |   a
    e       |   c
    e       |   b
    e       |  NULL

Expected output: all the records should belong to the same group with group ID = 1.


Given Table:

Identifier1 | Identifier2
--------------------------
a           |   a
b           |   b
c           |   a
c           |   b
c           |   c

Expected output: The records should be in the same group with group ID = 1.


Solution

  • Here is a variant that doesn't use cursor, but uses a single recursive query.

    Essentially, it treats the data as edges in a graph and traverses recursively all edges of the graph, stopping when the loop is detected. Then it puts all found loops in groups and gives each group a number.

    See the detailed explanations of how it works below. I recommend you to run the query CTE-by-CTE and examine each intermediate result to understand what it does.

    Sample 1

    DECLARE @T TABLE (ID int, Ident1 char(1), Ident2 char(1));
    INSERT INTO @T (ID, Ident1, Ident2) VALUES
    (1, 'a', 'a'),
    (2, 'b', 'b'),
    (3, 'c', 'a'),
    (4, 'c', 'b'),
    (5, 'c', 'c');
    

    Sample 2

    I added one more row with z value to have multiple rows with unpaired values.

    DECLARE @T TABLE (ID int, Ident1 char(1), Ident2 char(1));
    INSERT INTO @T (ID, Ident1, Ident2) VALUES
    (1, 'a', 'a'),
    (1, 'a', 'c'),
    (2, 'b', 'f'),
    (3, 'a', 'g'),
    (4, 'c', 'h'),
    (5, 'b', 'j'),
    (6, 'd', 'f'),
    (7, 'e', 'k'),
    (8, 'i', NULL),
    (88, 'z', 'z'),
    (9, 'l', 'h');
    

    Sample 3

    DECLARE @T TABLE (ID int, Ident1 char(1), Ident2 char(1));
    INSERT INTO @T (ID, Ident1, Ident2) VALUES
    (1, 'a', 'f'),
    (2, 'a', 'g'),
    (3, 'a', NULL),
    (4, 'b', 'c'),
    (5, 'b', 'a'),
    (6, 'b', 'h'),
    (7, 'b', 'j'),
    (8, 'b', NULL),
    (9, 'b', NULL),
    (10, 'b', 'g'),
    (11, 'c', 'k'),
    (12, 'c', 'b'),
    (13, 'd', 'l'),
    (14, 'd', 'f'),
    (15, 'd', 'g'),
    (16, 'd', 'm'),
    (17, 'd', 'a'),
    (18, 'd', NULL),
    (19, 'd', 'a'),
    (20, 'e', 'c'),
    (21, 'e', 'b'),
    (22, 'e', NULL);
    

    Query

    WITH
    CTE_Idents
    AS
    (
        SELECT Ident1 AS Ident
        FROM @T
    
        UNION
    
        SELECT Ident2 AS Ident
        FROM @T
    )
    ,CTE_Pairs
    AS
    (
        SELECT Ident1, Ident2
        FROM @T
        WHERE Ident1 <> Ident2
    
        UNION
    
        SELECT Ident2 AS Ident1, Ident1 AS Ident2
        FROM @T
        WHERE Ident1 <> Ident2
    )
    ,CTE_Recursive
    AS
    (
        SELECT
            CAST(CTE_Idents.Ident AS varchar(8000)) AS AnchorIdent 
            , Ident1
            , Ident2
            , CAST(',' + Ident1 + ',' + Ident2 + ',' AS varchar(8000)) AS IdentPath
            , 1 AS Lvl
        FROM 
            CTE_Pairs
            INNER JOIN CTE_Idents ON CTE_Idents.Ident = CTE_Pairs.Ident1
    
        UNION ALL
    
        SELECT 
            CTE_Recursive.AnchorIdent 
            , CTE_Pairs.Ident1
            , CTE_Pairs.Ident2
            , CAST(CTE_Recursive.IdentPath + CTE_Pairs.Ident2 + ',' AS varchar(8000)) AS IdentPath
            , CTE_Recursive.Lvl + 1 AS Lvl
        FROM
            CTE_Pairs
            INNER JOIN CTE_Recursive ON CTE_Recursive.Ident2 = CTE_Pairs.Ident1
        WHERE
            CTE_Recursive.IdentPath NOT LIKE CAST('%,' + CTE_Pairs.Ident2 + ',%' AS varchar(8000))
    )
    ,CTE_RecursionResult
    AS
    (
        SELECT AnchorIdent, Ident1, Ident2
        FROM CTE_Recursive
    )
    ,CTE_CleanResult
    AS
    (
        SELECT AnchorIdent, Ident1 AS Ident
        FROM CTE_RecursionResult
    
        UNION
    
        SELECT AnchorIdent, Ident2 AS Ident
        FROM CTE_RecursionResult
    )
    SELECT
        CTE_Idents.Ident
        ,CASE WHEN CA_Data.XML_Value IS NULL 
        THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END AS GroupMembers
        ,DENSE_RANK() OVER(ORDER BY 
            CASE WHEN CA_Data.XML_Value IS NULL 
            THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END
        ) AS GroupID
    FROM
        CTE_Idents
        CROSS APPLY
        (
            SELECT CTE_CleanResult.Ident+','
            FROM CTE_CleanResult
            WHERE CTE_CleanResult.AnchorIdent = CTE_Idents.Ident
            ORDER BY CTE_CleanResult.Ident FOR XML PATH(''), TYPE
        ) AS CA_XML(XML_Value)
        CROSS APPLY
        (
            SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
        ) AS CA_Data(XML_Value)
    WHERE
        CTE_Idents.Ident IS NOT NULL
    ORDER BY Ident;
    

    Result 1

    +-------+--------------+---------+
    | Ident | GroupMembers | GroupID |
    +-------+--------------+---------+
    | a     | a,b,c,       |       1 |
    | b     | a,b,c,       |       1 |
    | c     | a,b,c,       |       1 |
    +-------+--------------+---------+
    

    Result 2

    +-------+--------------+---------+
    | Ident | GroupMembers | GroupID |
    +-------+--------------+---------+
    | a     | a,c,g,h,l,   |       1 |
    | b     | b,d,f,j,     |       2 |
    | c     | a,c,g,h,l,   |       1 |
    | d     | b,d,f,j,     |       2 |
    | e     | e,k,         |       3 |
    | f     | b,d,f,j,     |       2 |
    | g     | a,c,g,h,l,   |       1 |
    | h     | a,c,g,h,l,   |       1 |
    | i     | i            |       4 |
    | j     | b,d,f,j,     |       2 |
    | k     | e,k,         |       3 |
    | l     | a,c,g,h,l,   |       1 |
    | z     | z            |       5 |
    +-------+--------------+---------+
    

    Result 3

    +-------+--------------------------+---------+
    | Ident |       GroupMembers       | GroupID |
    +-------+--------------------------+---------+
    | a     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | b     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | c     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | d     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | e     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | f     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | g     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | h     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | j     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | k     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | l     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    | m     | a,b,c,d,e,f,g,h,j,k,l,m, |       1 |
    +-------+--------------------------+---------+
    

    How it works

    I'll use the second set of sample data for this explanation.

    CTE_Idents

    CTE_Idents gives the list of all Identifiers that appear in both Ident1 and Ident2 columns. Since they can appear in any order we UNION both columns together. UNION also removes any duplicates.

    +-------+
    | Ident |
    +-------+
    | NULL  |
    | a     |
    | b     |
    | c     |
    | d     |
    | e     |
    | f     |
    | g     |
    | h     |
    | i     |
    | j     |
    | k     |
    | l     |
    | z     |
    +-------+
    

    CTE_Pairs

    CTE_Pairs gives the list of all edges of the graph in both directions. Again, UNION is used to remove any duplicates.

    +--------+--------+
    | Ident1 | Ident2 |
    +--------+--------+
    | a      | c      |
    | a      | g      |
    | b      | f      |
    | b      | j      |
    | c      | a      |
    | c      | h      |
    | d      | f      |
    | e      | k      |
    | f      | b      |
    | f      | d      |
    | g      | a      |
    | h      | c      |
    | h      | l      |
    | j      | b      |
    | k      | e      |
    | l      | h      |
    +--------+--------+
    

    CTE_Recursive

    CTE_Recursive is the main part of the query that recursively traverses the graph starting from each unique Identifier. These starting rows are produced by the first part of UNION ALL. The second part of UNION ALL recursively joins to itself linking Ident2 to Ident1. Since we pre-made CTE_Pairs with all edges written in both directions, we can always link only Ident2 to Ident1 and we'll get all paths in the graph. At the same time the query builds IdentPath - a string of comma-delimited Identifiers that have been traversed so far. It is used in the WHERE filter:

    CTE_Recursive.IdentPath NOT LIKE CAST('%,' + CTE_Pairs.Ident2 + ',%' AS varchar(8000))
    

    As soon as we come across the Identifier that had been included in the Path before, the recursion stops as the list of connected nodes is exhausted. AnchorIdent is the starting Identifier for the recursion, it will be used later to group results. Lvl is not really used, I included it for better understanding of what is going on.

    +-------------+--------+--------+-------------+-----+
    | AnchorIdent | Ident1 | Ident2 |  IdentPath  | Lvl |
    +-------------+--------+--------+-------------+-----+
    | a           | a      | c      | ,a,c,       |   1 |
    | a           | a      | g      | ,a,g,       |   1 |
    | b           | b      | f      | ,b,f,       |   1 |
    | b           | b      | j      | ,b,j,       |   1 |
    | c           | c      | a      | ,c,a,       |   1 |
    | c           | c      | h      | ,c,h,       |   1 |
    | d           | d      | f      | ,d,f,       |   1 |
    | e           | e      | k      | ,e,k,       |   1 |
    | f           | f      | b      | ,f,b,       |   1 |
    | f           | f      | d      | ,f,d,       |   1 |
    | g           | g      | a      | ,g,a,       |   1 |
    | h           | h      | c      | ,h,c,       |   1 |
    | h           | h      | l      | ,h,l,       |   1 |
    | j           | j      | b      | ,j,b,       |   1 |
    | k           | k      | e      | ,k,e,       |   1 |
    | l           | l      | h      | ,l,h,       |   1 |
    | l           | h      | c      | ,l,h,c,     |   2 |
    | l           | c      | a      | ,l,h,c,a,   |   3 |
    | l           | a      | g      | ,l,h,c,a,g, |   4 |
    | j           | b      | f      | ,j,b,f,     |   2 |
    | j           | f      | d      | ,j,b,f,d,   |   3 |
    | h           | c      | a      | ,h,c,a,     |   2 |
    | h           | a      | g      | ,h,c,a,g,   |   3 |
    | g           | a      | c      | ,g,a,c,     |   2 |
    | g           | c      | h      | ,g,a,c,h,   |   3 |
    | g           | h      | l      | ,g,a,c,h,l, |   4 |
    | f           | b      | j      | ,f,b,j,     |   2 |
    | d           | f      | b      | ,d,f,b,     |   2 |
    | d           | b      | j      | ,d,f,b,j,   |   3 |
    | c           | h      | l      | ,c,h,l,     |   2 |
    | c           | a      | g      | ,c,a,g,     |   2 |
    | b           | f      | d      | ,b,f,d,     |   2 |
    | a           | c      | h      | ,a,c,h,     |   2 |
    | a           | h      | l      | ,a,c,h,l,   |   3 |
    +-------------+--------+--------+-------------+-----+
    

    CTE_CleanResult

    CTE_CleanResult leaves only relevant parts from CTE_Recursive and again merges both Ident1 and Ident2 using UNION.

    +-------------+-------+
    | AnchorIdent | Ident |
    +-------------+-------+
    | a           | a     |
    | a           | c     |
    | a           | g     |
    | a           | h     |
    | a           | l     |
    | b           | b     |
    | b           | d     |
    | b           | f     |
    | b           | j     |
    | c           | a     |
    | c           | c     |
    | c           | g     |
    | c           | h     |
    | c           | l     |
    | d           | b     |
    | d           | d     |
    | d           | f     |
    | d           | j     |
    | e           | e     |
    | e           | k     |
    | f           | b     |
    | f           | d     |
    | f           | f     |
    | f           | j     |
    | g           | a     |
    | g           | c     |
    | g           | g     |
    | g           | h     |
    | g           | l     |
    | h           | a     |
    | h           | c     |
    | h           | g     |
    | h           | h     |
    | h           | l     |
    | j           | b     |
    | j           | d     |
    | j           | f     |
    | j           | j     |
    | k           | e     |
    | k           | k     |
    | l           | a     |
    | l           | c     |
    | l           | g     |
    | l           | h     |
    | l           | l     |
    +-------------+-------+
    

    Final SELECT

    Now we need to build a string of comma-separated Ident values for each AnchorIdent. CROSS APPLY with FOR XML does it. DENSE_RANK() calculates the GroupID numbers for each AnchorIdent.