rgroupingids

R, Assign a group ID to combinations of values in another observation ID variable


Assume that I have several years of personal data, in which a Personal ID (indiv) identifies individuals across years, but the household ID (househ) only assigns household relation within a year.

Example:

libary(dplyr)
test_01 <- data.frame(indiv=c(1,2,3,4,5,6),househ=c(1,1,2,3,4,4),time=rep(1,6)) #1&2 form a household, 3 and 4 are single, 5&6 form a household
test_02 <- data.frame(indiv=c(2,3,4,5,6,7),househ=c(1,2,2,3,3,4),time=rep(2,6)) #1 exits, so 2 is now a new household, 3&4 now form a new household, 5&6 still do, 7 enters
test_03 <- data.frame(indiv=c(2,3,4,5,7,8,9,10),househ=c(1,2,2,3,4,5,5,6),time=rep(3,8)) #according to logic above
data_test_panel <- bind_rows(test_01,test_02,test_03)

The desired time-consistent household variable would be:

data_test_panel$true_household <- c(1,1,2,3,4,4,5,6,6,4,4,7,5,6,6,8,7,9,9,10)

So far I tried:

library(data.table)
setDT(data_test_panel)[,cons_household := .GRP,.(time,househ)] # where cons_household is the new household ID. However, this doesn't give the same household ID across time but assigns new values for every appearance of a household.

Thank you kindly for all your help! /Severin


Solution

  • Create a list column of unique individuals in a given household at each time period. We can then take the group number by that column.

    dplyr approach

    This is much nicer than the data.table approach:

    library(dplyr)
    data_test_panel |>
        mutate(
            indiv_in_household = list(indiv),
            .by = c(househ, time)
        ) |>
        mutate(
            cons_household = cur_group_id(),
            .by = indiv_in_household
        )
    

    Note this uses per-operation grouping with the .by parameter which requires at least dplyr v1.1.0.

    data.table approach

    The syntax for this is much more complicated because data.table will automatically unroll your list column into a vector if you try to make it list(indiv) by group.

    data_test_panel[, indiv_in_household := .SD[
        , .(split(
            rep(indiv, length(indiv)),
            rep(seq(length(indiv)), each = length(indiv))
        )),
        .(househ, time)
    ]$V1]
    
    head(data_test_panel)
    #    indiv househ  time true_household indiv_in_household
    #    <num>  <num> <num>          <num>             <list>
    # 1:     1      1     1              1                1,2
    # 2:     2      1     1              1                1,2
    # 3:     3      2     1              2                  3
    # 4:     4      3     1              3                  4
    # 5:     5      4     1              4                5,6
    # 6:     6      4     1              4                5,6
    

    Also data.table doesn't support grouping by list columns so we need to use toString(), which is OK in this case but in general is not ideal as you have to think through the format of the column to ensure it will not somehow cause ambiguity leading two distinct groups to be merged.

    data_test_panel[, cons_household := .GRP, sapply(indiv_in_household, toString)]
    
    data_test_panel[cons_household != true_household]
    # Empty data.table (0 rows and 6 cols): indiv,househ,time,true_household,cons_household,indiv_in_household