I have a data set with two columns (Person and Item) and I want to generate a group column (Group) based on column1 but the group will expand to include rows which have the same values as column 2.
Sample1 - because 1 and 2 both have a and b, they are considered 1 group. 3 is separated because it is the only one with item a.
Person | Item | Group |
---|---|---|
1 | a | 1 |
1 | b | 1 |
2 | a | 1 |
2 | b | 1 |
3 | a | 2 |
Sample2 - because 1 and 2 have different items, they are considered separate groups. 3 is also separate even though it has the same "a" item because it is the only item it has.
Person | Item | Group |
---|---|---|
1 | a | 1 |
1 | b | 1 |
2 | c | 2 |
2 | d | 2 |
3 | a | 3 |
Sample3 - both 1 and 2 have the same items so they are all considered as 1 group
Person | Item | Group |
---|---|---|
1 | a | 1 |
1 | b | 1 |
1 | c | 1 |
2 | a | 1 |
2 | b | 1 |
2 | c | 1 |
I found a similar post but this doesn't have the same grouping/checking mechanic on the second column: Group identifiers/values that are related with each other between multiple columns
I'm thinking I need to do a self-join based on Item and loop through it, but a bit stuck as of the moment.
Thanks in advance!
Identify sets of items per person, then group by that sets.
with t (person, item) as (
select 1, 'a' union all
select 1, 'b' union all
select 2, 'a' union all
select 2, 'b' union all
select 3, 'a'
)
, person_to_itemset as (
select person, string_agg(item,',') within group (order by item) as itemset
from t
group by person
)
, itemset_to_group as (
select itemset, row_number() over (order by min(person)) grp
from person_to_itemset
group by itemset
)
select t.*, ig.grp
from t
join person_to_itemset pi on t.person = pi.person
join itemset_to_group ig on pi.itemset = ig.itemset
person | item | grp |
---|---|---|
1 | a | 1 |
1 | b | 1 |
2 | a | 1 |
2 | b | 1 |
3 | a | 2 |
(Note that string_agg
function was used just to demonstration purposes. To work safely the item
column must not contain the comma character and probably it won't perform well on large number of item
s per person. Grouping directly into collection type could be alternative, unfortunately I actually don't use SQL Server and am not certain if it supports some collection, including ability to equality comparison for grouping, it is also important whether item
s per person
form set or multiset.)