rdplyrtime-series

Creating Count (With Reset) Variable


I have a dataset where I have a binary indicator of event occurences. From this list, I would like to create a count of the number of consecutive timesteps without an event occuring. As an example (TS = Timestep, EV = Event Indicator, C = Count):

TS1 -> TS2 -> TS3 -> TS4 -> TS5 ->...

EV0 -> EV0 -> EV1 -> EV0 -> EV0 ->...

C0 -> C1 -> C0 -> C0 -> C1 ->...

As an example dataframe, please consider:

labs <- c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "D", "D", "D", "D", "D")
time <- c(1,2,3,4 ,1,2,3,4 ,1,2,3,4 ,1,2,3,4,5)
event <- c(0,0,0,0, 0,1,0,0, 1,1,0,0, NA,0,0,1,0)
desiredOutcome <- c(0,1,2,3,0,0,0,1,0,0,0,1,NA,0,1,0,0) # goal

exDF <- data.frame(labs,time, event, desiredOutcome)

From that end goal and dataframe, I ended up with the following code:

library(dplyr)

exDF <- exDF %>%
  group_by(labs) %>%
  mutate(pe1 = lag(event, order_by=time)) # create new variable for prior event


exDF$count2 <- ifelse(
  ((exDF$pe1 == 1) & (exDF$event == 0)), # condition checks for rows where previous timestep is included & had event WHERE event is not ongoing in this timestep
  0, # True val
  NA) # False val


exDF$count <- ifelse(
  (is.na(exDF$pe1) & (exDF$event == 0)), # condition checks for rows where previous timestep is not included & no current event
  0, # True val
  exDF$count2) # False val

It seems to fill in all of the zeros correctly. But, I am unaware of a good way to get from the appropriate 0s filled and the other values with NAs to my desired outcome.

Most of my experimentation related to combining mutate and lag, but they lead to just the next set of values filled in (if the zeros were in the input column, the ones alone show up; if ones, then twos). The following example doesn't attempt to handle the reset of the count, but leads to the behavior described above:

exDF <- exDF %>%
  group_by(labs) %>%
  mutate(countFinal = lag(count, order_by=time) + 1) 

So, my challenge relates to the order in which things are resolved. With something like the mutate command here, the order seems to be:

Pull all cell values by label -> Look at their lags -> Add 1 -> Done, but incorrectly

When I need it to be something like:

Pull first cell value by label -> Look at lag -> Add 1 or reset -> Pull second cell (filled in prior step) value by label -> Look at their lags -> Add 1 or reset -> Pull third... -> Done

Is there a good way to do this with an existing package?


Solution

  • Could not think of a more direct approach, but this works. The workflow:

    1. Create copy of event (tmp) and replace NA with a unique value e.g. 2
    2. Give each event grouping a unique id
    3. replace() first value in each event grouping with zero and change remaining non-zero group id values to 1
    4. Return cumulative sum of tmp column
    5. Correct values that should be NA in desiredOutcome column
    library(dplyr)
    
    exDF |>
      group_by(labs) |>
      mutate(tmp = if_else(is.na(event), 2, event),
             tmp = cumsum(tmp != lag(tmp, default = 1))) |>
      group_by(labs, tmp) |>
      mutate(tmp = replace(tmp, 1, 0),
             tmp = if_else(tmp != 0, 1, 0),
             tmp = cumsum(tmp),
             desiredOutcome = if_else(is.na(event), NA, desiredOutcome)) |>
      ungroup() |>
      select(-tmp)
             
    # # A tibble: 17 × 4
    #    labs   time event desiredOutcome
    #    <chr> <dbl> <dbl>          <dbl>
    #  1 A         1     0              0
    #  2 A         2     0              1
    #  3 A         3     0              2
    #  4 A         4     0              3
    #  5 B         1     0              0
    #  6 B         2     1              0
    #  7 B         3     0              0
    #  8 B         4     0              1
    #  9 C         1     1              0
    # 10 C         2     1              0
    # 11 C         3     0              0
    # 12 C         4     0              1
    # 13 D         1    NA             NA
    # 14 D         2     0              0
    # 15 D         3     0              1
    # 16 D         4     1              0
    # 17 D         5     0              0