rdataframeuniqueidentifier

Assign cumulative identifier to unique sets for each ID


I have a dataframe which contains two columns, id column id and an observation column val with observations from a-g in random order.

data.frame(
  id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
  val = c("a", "b", "c", "d", "e", "f", "a", "b", "b", "a", "c", "e", "d", "a", "b","c","f","g")
)

   id val
1   1   a
2   1   b
3   1   c
4   1   d
5   1   e
6   2   f
7   2   a
8   2   b
9   3   b
10  3   a
11  3   c
12  3   e
13  3   d
14  4   a
15  4   b
16  4   c
17  4   f
18  4   g

I want to assign an (cumulative) identifier valid for each unique set of observations (belonging to each id) given in column val, like so:

   id val valid
1   1   a     1
2   1   b     1
3   1   c     1
4   1   d     1
5   1   e     1
6   2   f     2
7   2   a     2
8   2   b     2
9   3   b     1
10  3   a     1
11  3   c     1
12  3   e     1
13  3   d     1
14  4   a     3
15  4   b     3
16  4   c     3
17  4   f     3
18  4   g     3

Solution

  • In tidyverse.

    1. Create a list-column of sorted (for comparability) values by id
    2. Group by this new column, and use cur_group_id to create the valid column
    3. Unnest to get the desired output
    library(dplyr)
    library(tidyr)
    
    df |> 
      reframe(val = list(sort(val)), .by = id) |> 
      mutate(valid = cur_group_id(), .by = val) |> 
      unnest(val)
    
    # # A tibble: 18 × 3
    #       id val   valid
    #    <int> <chr> <int>
    #  1     1 a         1
    #  2     1 b         1
    #  3     1 c         1
    #  4     1 d         1
    #  5     1 e         1
    #  6     2 a         2
    #  7     2 b         2
    #  8     2 f         2
    #  9     3 a         1
    # 10     3 b         1
    # 11     3 c         1
    # 12     3 d         1
    # 13     3 e         1
    # 14     4 a         3
    # 15     4 b         3
    # 16     4 c         3
    # 17     4 f         3
    # 18     4 g         3