sqlsql-serversql-server-2012groupingrelational-division

Grouping of data based on patterns or sets of other rows


I have a data set with two columns (Person and Item) and I want to generate a group column (Group) based on column1 but the group will expand to include rows which have the same values as column 2.

Sample1 - because 1 and 2 both have a and b, they are considered 1 group. 3 is separated because it is the only one with item a.

Person Item Group
1 a 1
1 b 1
2 a 1
2 b 1
3 a 2

Sample2 - because 1 and 2 have different items, they are considered separate groups. 3 is also separate even though it has the same "a" item because it is the only item it has.

Person Item Group
1 a 1
1 b 1
2 c 2
2 d 2
3 a 3

Sample3 - both 1 and 2 have the same items so they are all considered as 1 group

Person Item Group
1 a 1
1 b 1
1 c 1
2 a 1
2 b 1
2 c 1

I found a similar post but this doesn't have the same grouping/checking mechanic on the second column: Group identifiers/values that are related with each other between multiple columns

I'm thinking I need to do a self-join based on Item and loop through it, but a bit stuck as of the moment.

Thanks in advance!


Solution

  • Identify sets of items per person, then group by that sets.

    with t (person, item) as (
    select 1, 'a' union all
    select 1, 'b' union all
    select 2, 'a' union all
    select 2, 'b' union all
    select 3, 'a'
    )
    , person_to_itemset as (
    select person, string_agg(item,',') within group (order by item) as itemset
    from t
    group by person
    )
    , itemset_to_group as (
    select itemset, row_number() over (order by min(person)) grp
    from person_to_itemset
    group by itemset
    )
    select t.*, ig.grp
    from t
    join person_to_itemset pi on t.person = pi.person
    join itemset_to_group ig on pi.itemset = ig.itemset
    
    
    person item grp
    1 a 1
    1 b 1
    2 a 1
    2 b 1
    3 a 2

    fiddle

    (Note that string_agg function was used just to demonstration purposes. To work safely the item column must not contain the comma character and probably it won't perform well on large number of items per person. Grouping directly into collection type could be alternative, unfortunately I actually don't use SQL Server and am not certain if it supports some collection, including ability to equality comparison for grouping, it is also important whether items per person form set or multiset.)