rdplyrtime-seriessequencing

Code sequence by group in R: recurring values within group


I would like to code a numerical sequence based on the combination of three columns: ID, year, and location. I want to number the years a person spent in one place as a sequence. The sequence should start anew in the year of a location change, so even if a person returns to a place that they have been before, the sequence should start anew.

The df with sequence should look like:

ID yr loc seq
1 1990 A 1
1 1991 A 2
1 1992 B 1
1 1993 B 2
1 1994 B 3
2 1990 B 1
2 1991 B 2
2 1992 A 1
2 1993 B 1
2 1994 B 2
3 1990 C 1
3 1991 C 2
3 1992 C 3
3 1993 B 1
3 1994 C 1

Do you have an idea on how to code this in R?


Solution

  • In data.table, we have rleid function which makes it simple here.

    library(data.table)
    
    setDT(df)[, seq1 := seq_len(.N), .(ID, rleid(loc))]
    
    df
    #    ID   yr loc seq seq1
    # 1:  1 1990   A   1    1
    # 2:  1 1991   A   2    2
    # 3:  1 1992   B   1    1
    # 4:  1 1993   B   2    2
    # 5:  1 1994   B   3    3
    # 6:  2 1990   B   1    1
    # 7:  2 1991   B   2    2
    # 8:  2 1992   A   1    1
    # 9:  2 1993   B   1    1
    #10:  2 1994   B   2    2
    #11:  3 1990   C   1    1
    #12:  3 1991   C   2    2
    #13:  3 1992   C   3    3
    #14:  3 1993   B   1    1
    #15:  3 1994   C   1    1
    

    We can use rleid in dplyr and base R approaches to get expected output.

    library(dplyr)
    df %>%
     group_by(ID, grp = data.table::rleid(loc)) %>%
      mutate(seq1 = row_number())
    

    Or in base R :

    df$seq1 <- with(df, ave(yr, ID, data.table::rleid(loc), FUN = seq_along))
    

    A concise option suggested by @chinsoon12 is to use rowid function.

    setDT(df)[, seq2 := rowid(ID, rleid(loc))]
    

    data

    df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
    3L, 3L, 3L, 3L, 3L), yr = c(1990L, 1991L, 1992L, 1993L, 1994L, 
    1990L, 1991L, 1992L, 1993L, 1994L, 1990L, 1991L, 1992L, 1993L, 
    1994L), loc = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
    2L, 3L, 3L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    seq = c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 3L, 
    1L, 1L)), class = "data.frame", row.names = c(NA, -15L))