rdplyr

Return when last event occurred in dataframe by group


My data frame:

set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))
ID round_number B event_occurred
1 1 0
1 2 0
1 3 0
1 4 1
1 5 0
1 6 1
1 7 1
1 8 0
1 9 0
1 10 0
1 11 0

I want a new column lagEventOccurred which tells me how many rounds ago the event has last occurred, i.e.:

lagEventOccurred <- c(Inf, Inf, Inf, Inf, Inf, 1,  2, 1, 1, 2, 3, 4, 5, 6, 7)

for the first few entries and so on. Note that if the event occurred in the same round, it is not considered. I only want to look at previous rounds. All of this should be grouped by ID.

ID round_number B event_occurred lagEventOccurred
1 1 0 Inf
1 2 0 Inf
1 3 0 Inf
1 4 1 Inf
1 5 0 1
1 6 1 2
1 7 1 1
1 8 0 1
1 9 0 2
1 10 0 3
1 11 0 4

I tried multiple things, including something like

minimalsample %>% dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) )  %>%
  dplyr::mutate(secondstep = ifelse((lag(firststep) < round_number), lag(firststep), firststep )) %>%
  dplyr::mutate(thirdstep = ifelse((lag(firststep) < round_number), round_number- lag(firststep), round_number-firststep ))

but it seems that I would have to iterate over the lag() procedure 20 times to get the desired results. How to return how many lag()s ago a condition was satisfied?


Solution

  • I have assumed that if the beginning of each "ID" group is a sequence of zeros, it should be treated as Inf. Using the dplyr and tidyr packages, the workflow is:

    1. Create copy of "event_occurred" called "lagEventOccurred"
    2. Within each "ID" group, create a new 'id' value for every event using cumsum()
    3. Use fill() to define start and end of each event 'group'
    4. Lag the "lagEventOccurred" column and use seq_along() to return lagged length since last event occurred
    library(dplyr)
    library(tidyr)
    
    set.seed(1)
    minimalsample <- data.frame(ID=rep(1:5, each= 20), 
                                round_number= rep(1:20, 5), 
                                event_occurred=rbinom(100, size=1, prob=0.2))
    
    
    minimalsample <- minimalsample %>%
      mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %>%
      group_by(ID, lagEventOccurred) %>%
      mutate(lagEventOccurred = cumsum(lagEventOccurred)) %>%
      ungroup() %>%
      group_by(ID) %>%
      fill(lagEventOccurred) %>%
      mutate(lagEventOccurred = lag(lagEventOccurred)) %>%
      group_by(ID, lagEventOccurred) %>%
      mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred), 
                                       Inf, seq_along(lagEventOccurred))) %>%
      ungroup()
    
    
    head(minimalsample, 10)
    # A tibble: 10 × 4
         ID round_number event_occurred lagEventOccurred
      <int>        <int>          <int>            <dbl>
    1     1            1              0              Inf
    2     1            2              0              Inf
    3     1            3              0              Inf
    4     1            4              1              Inf
    5     1            5              0                1
    6     1            6              1                2
    7     1            7              1                1
    8     1            8              0                1
    9     1            9              0                2
    10    1           10              0                3
    
    tail(minimalsample, 10)
    # A tibble: 10 × 4
         ID round_number event_occurred lagEventOccurred
      <int>        <int>          <int>            <dbl>
    1     5           11              0              Inf
    2     5           12              0              Inf
    3     5           13              0              Inf
    4     5           14              1              Inf
    5     5           15              0                1
    6     5           16              0                2
    7     5           17              0                3
    8     5           18              0                4
    9     5           19              1                5
    10    5           20              0                1