rdplyrlongitudinal

Collapsing longitudinal data into time periods by identifier with dplyr


I have the following example data where there are daily observations for columns X, Y, Z, and C:

structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3), day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12), X = c(1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 
0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    Y = c(0, 0, 1, 2, 1, 1, 0, 1, 3, 1, 0, 0, 1, 1, 0, 0, 1, 
    1, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 3), 
    Z = c("NA", "NA", "NA", "NA", "1", "NA", "NA", "NA", "NA", 
    "NA", "1", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
    "1", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
    "NA", "NA", "NA", "NA", "NA", "NA", "NA"), C = c(22, 22, 
    23, 22, 24, 23, 22, 25, 25, 24, 25, 25, 24, 26, 18, 17, 17, 
    18, 18, 19, 19, 19, 20, 21, 34, 34, 34, 34, 34, 35, 35, 35, 
    35, 35, 35, 35)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -36L))

I am trying to collapse by ID into data as below for periods of 3 days (except the first baseline day):

structure(list(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 
3, 3), day = c("1", "2 to 4", "5 to 7", "8 to 10", "11 to 13", 
"14", "1", "2 to 4", "5 to 7", "8 to 10", "1", "2 to 4", "5 to 7", 
"8 to 10", "11 to 12"), Xmax = c(1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 
0, 1, 1, 1, 1), Ymax = c(0, 2, 1, 3, 0, 1, 0, 1, 3, 0, 0, 0, 
0, 1, 3), Zany = c(0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 
0), Cmean = c(22, 22.3333333333333, 23, 24.6666666666667, 24.6666666666667, 
26, 18, 17.3333333333333, 18.6666666666667, 20, 34, 34, 34.6666666666667, 
35, 35)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-15L))

where the first observation is always the first day baseline, then groups of 3 days where the maximum value during that period is a new observation (Xmax/Ymax), or any observation in that period (Zany), and the mean of continuous observations during that period (Cmean).

The number of days for each ID are not the same for all, but the last period does not have to be 3 days (just the remainder).

I have tried using summarise and across in dplyr, but can't seem to figure out a) how to skip first row of each ID and b) the syntax of multiple column specifications for max, any, and mean without doing all separately.

Is it possible to do with one data frame without making multiple for each different type of summary?


Solution

  • We may use cut to create the groups

    library(dplyr)
    df1 %>%
       type.convert(as.is = TRUE) %>% 
       mutate(grp = cut(day, breaks = c(-Inf, seq(1, max(day),
           by = 3), Inf))) %>% 
       group_by(ID, grp) %>%
       summarise(across(X:Y, ~ max(.x, na.rm = TRUE),
      .names = "{.col}_max"), Zany = +any(!is.na(Z)),
         Cmean = mean(C, na.rm = TRUE), .groups = 'drop')
    

    -output

    # A tibble: 15 × 6
          ID grp       X_max Y_max  Zany Cmean
       <int> <fct>     <int> <int> <int> <dbl>
     1     1 (-Inf,1]      1     0     0  22  
     2     1 (1,4]         1     2     0  22.3
     3     1 (4,7]         0     1     1  23  
     4     1 (7,10]        1     3     0  24.7
     5     1 (10,13]       1     1     1  24.7
     6     1 (13, Inf]     1     1     0  26  
     7     2 (-Inf,1]      0     0     0  18  
     8     2 (1,4]         0     1     0  17.3
     9     2 (4,7]         1     3     1  18.7
    10     2 (7,10]        1     0     0  20  
    11     3 (-Inf,1]      0     0     0  34  
    12     3 (1,4]         1     0     0  34  
    13     3 (4,7]         1     0     0  34.7
    14     3 (7,10]        1     1     0  35  
    15     3 (10,13]       1     3     0  35