rdataframedplyr

Transition tables from longitudinal data in the long format using R


This question is about how to generate frequency transition tables from longitudinal data in the long format using R base functions or commonly used packages such as dplyr. Consider the longitudinal data

id <- c(1,1,2,2,3,3,4,4)
state <- c("C","A", "A", "A", "B", "A", "C", "A")
period <- rep(c("Start", "End"), 4)
df <- data.frame(id, state, period)
df

  id state period
1  1      C  Start
2  1      A    End
3  2      A  Start
4  2      A    End
5  3      B  Start
6  3      A    End
7  4      C  Start
8  4      A    End

and the expected output

    transition freq
1     A to A    1
2     A to B    0
3     A to C    0
4     B to B    0
5     B to A    1
6     B to C    0
7     C to C    0
8     C to A    2
9     C to B    0

I can generate the above output using the function statetable.msm in the msm package. However, I would like to know if this could be generated by base functions in R or other packages such as dplyr. Help is much appreciated!


Solution

  • With tidyverse, using left_join on the crossing states

    library(dplyr)
    library(tidyr)
    
    left_join(crossing(Start = df$state, End = df$state), 
              pivot_wider(df, names_from = period, values_from = state)) %>% 
      reframe(freq = id * 0 + n(), .by = c(Start, End)) %>% 
      distinct()
    

    output

    # A tibble: 9 × 3
      Start End    freq
      <chr> <chr> <dbl>
    1 A     A         1
    2 A     B        NA
    3 A     C        NA
    4 B     A         1
    5 B     B        NA
    6 B     C        NA
    7 C     A         2
    8 C     B        NA
    9 C     C        NA
    

    or, to get the complete desired format

    left_join(crossing(Start = df$state, End = df$state), 
              pivot_wider(df, names_from = period, values_from = state)) %>% 
      mutate(freq = id * 0 + n(),
             freq = replace(freq, is.na(freq), 0), .by = c(Start, End)) %>% 
      mutate(transition = purrr::map2_vec(Start, End, ~ paste(.x, "to", .y)), .before=1) %>%
      select(-c(Start, End, id)) %>% 
      distinct()
    Joining with `by = join_by(Start, End)`
    # A tibble: 9 × 2
      transition  freq
      <chr>      <dbl>
    1 A to A         1
    2 A to B         0
    3 A to C         0
    4 B to A         1
    5 B to B         0
    6 B to C         0
    7 C to A         2
    8 C to B         0
    9 C to C         0