sql postgresql snowflake-cloud-data-platform postgresql-17

How to create group column based on ids and linked ids in PostgreSQL

I have a table containing incident ids. Each incident can be linked to another one because they have been stated as similar. The table contains two columns:

incident_id the id of the incident
linked_incident_id the id of a linked incident

I would like to make groups based on these ids.

I tried to make left joins to link the ids but if an id is linked to n ids I will need to make n left joins. Imagine n=100, this would become absurd.

Here is a simplified version of my table:

incident_id	linked_incident_id
1	5
4	8
2	3
8	2
9	10
5	null
7	9
2	8
9	7
6	10
10	null
12	11

Here is the expected output:

incident_id	linked_incident_id	groups
1	5	group 1
4	8	group 2
2	3	group 2
8	2	group 2
9	10	group 3
5	null	group 1
7	9	group 3
2	8	group 2
9	7	group 3
6	10	group 3
10	null	group 3
12	11	group 4
13	null	group 5

The actual data I am working with is bigger (approx. 25000 lines).

I am actually working with Snowflake and PostgreSQL 17, but any SQL engine would be helpful, I am missing an effective logic here. An SQL query would be ideal !

Solution

So, the problem is that incident1 is linked to incident2 which is linked to incident3 and so on and you need to make them be part of the same group. First of all, this is already possible with your data via recursive queries, here's an untested example

WITH RECURSIVE t AS (
    SELECT incident_id as group_id, incident_id from incidents where linked_incident_id is null
  union all
    select t.group_id, incidents.incident_id from t join incidents on t.incident_id = incidents.linked_incident_id
)
SELECT
    group_id,
    string_agg(incident_id::text, ',')
FROM t
group by group_id;

Tried this with the sample of

create table incidents (incident_id int, linked_incident_id int);

insert into incidents(incident_id, linked_incident_id)
values
(1, null),
(2, null),
(0, null),
(3, 1),
(5, 1),
(7, 3),
(9, 3),
(10, 2);

Fiddle.

So in this recursive query we first define the trivial records, that is, records whose linked incident is null (the root incidents) and define their incident_id as a group_id, which will later on transcend to their descendants via the second select after the union all, which finds descendants, then descendants of descendants and so on.

With this recursive logic we will end up having pairs of (group_id, incident_id) which are the pairs of root incidents and the current incident.

With this tool at our disposal we can simply select from this one, grouping by the group_id and concatenating the incident_id.

However, this is not an ideal solution. Because then you may end up with problems, when two issues link each-other and so forth. And of course, doing this kind of recursive search may not be best for performance either.

So, even though this is a solution for your current setup, I propose the following steps:

make sure that temporarily new records will not be inserted
create a root_incident_id field
update it for currently existent records, inspired by the results of the recursive query I have described earlier so each record will have a root incident pointing to the root problem's incident_id
create a trigger on your table for inserts, so if you create a new record, whose linked incident is not null, then update the root incident id to the root of the linked incident
create a trigger on your table for updates, so if you update an existent record, then update the root incident, to null if the linked incident is null and the root of the linked incident otherwise
create a trigger on your table for deletes, so if you delete an existent record whose linked incident is null, then 1. its immediate children (whose linked incident is the one you remove) will have null root and adjust the new root incident of their descendants

This way you will have a consistent database and you will not need recursion whenever you want to group by root incident