sqlsql-servert-sqlhive

Chaining joins in SQL based on dynamic table


The title may not be accurate for the question but here goes! I have the following table:

id1    id2    status
1      2      a
2      3      b
3      4      c
6      7      d
7      8      e
8      9      f
9      10     g

I would like to get the first id1 and last status based on a dynamic chain joining, meaning that the result table will be:

id    final_status
1     c
6     g

Logically, I want to construct the following arrays based on joining the table to itself:

id1    chained_ids    chained_status
1      [2,3,4]        [a,b,c]
6      [7,8,9,10]     [d,e,f,g]

Then grab the last element of the chained_status list.

Since if we were to keep joining this table to itself on id1 = id2 we would eventually have single rows with these results. The problem is that the number of joins is not constant (a single id may be chained many or few times). There is always a 1 to 1 mapping of id1 to id2.

Thanks in advance! This can be done in either T-SQL or Hive (if someone has a clever map-reduce solution).


Solution

  • You can do this with a recursive CTE:

    ;WITH My_CTE AS
    (
        SELECT
            id1,
            id2,
            status,
            1 AS lvl
        FROM
            My_Table T1
        WHERE
            NOT EXISTS
            (
                SELECT *
                FROM My_Table T2
                WHERE T2.id2 = T1.id1
            )
        UNION ALL
        SELECT
            CTE.id1,
            T3.id2,
            T3.status,
            CTE.lvl + 1
        FROM
            My_CTE CTE
        INNER JOIN My_Table T3 ON T3.id1 = CTE.id2
    )
    SELECT
        CTE.id1,
        CTE.status
    FROM
        My_CTE CTE
    INNER JOIN (SELECT id1, MAX(lvl) AS max_lvl FROM My_CTE GROUP BY id1) M ON
        M.id1 = CTE.id1 AND
        M.max_lvl = CTE.lvl