[SOLVED] Transition tables from longitudinal data in the long format using R

Transition tables from longitudinal data in the long format using R

This question is about how to generate frequency transition tables from longitudinal data in the long format using R base functions or commonly used packages such as dplyr. Consider the longitudinal data

id <- c(1,1,2,2,3,3,4,4)
state <- c("C","A", "A", "A", "B", "A", "C", "A")
period <- rep(c("Start", "End"), 4)
df <- data.frame(id, state, period)
df

  id state period
1  1      C  Start
2  1      A    End
3  2      A  Start
4  2      A    End
5  3      B  Start
6  3      A    End
7  4      C  Start
8  4      A    End

and the expected output

    transition freq
1     A to A    1
2     A to B    0
3     A to C    0
4     B to B    0
5     B to A    1
6     B to C    0
7     C to C    0
8     C to A    2
9     C to B    0

I can generate the above output using the function statetable.msm in the msm package. However, I would like to know if this could be generated by base functions in R or other packages such as dplyr. Help is much appreciated!

Solution

With tidyverse, using left_join on the crossing states

library(dplyr)
library(tidyr)

left_join(crossing(Start = df$state, End = df$state), 
          pivot_wider(df, names_from = period, values_from = state)) %>% 
  reframe(freq = id * 0 + n(), .by = c(Start, End)) %>% 
  distinct()

output

# A tibble: 9 × 3
  Start End    freq
  <chr> <chr> <dbl>
1 A     A         1
2 A     B        NA
3 A     C        NA
4 B     A         1
5 B     B        NA
6 B     C        NA
7 C     A         2
8 C     B        NA
9 C     C        NA

or, to get the complete desired format

left_join(crossing(Start = df$state, End = df$state), 
          pivot_wider(df, names_from = period, values_from = state)) %>% 
  mutate(freq = id * 0 + n(),
         freq = replace(freq, is.na(freq), 0), .by = c(Start, End)) %>% 
  mutate(transition = purrr::map2_vec(Start, End, ~ paste(.x, "to", .y)), .before=1) %>%
  select(-c(Start, End, id)) %>% 
  distinct()
Joining with `by = join_by(Start, End)`
# A tibble: 9 × 2
  transition  freq
  <chr>      <dbl>
1 A to A         1
2 A to B         0
3 A to C         0
4 B to A         1
5 B to B         0
6 B to C         0
7 C to A         2
8 C to B         0
9 C to C         0