My data frame:
set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))
ID | round_number B | event_occurred |
---|---|---|
1 | 1 | 0 |
1 | 2 | 0 |
1 | 3 | 0 |
1 | 4 | 1 |
1 | 5 | 0 |
1 | 6 | 1 |
1 | 7 | 1 |
1 | 8 | 0 |
1 | 9 | 0 |
1 | 10 | 0 |
1 | 11 | 0 |
I want a new column lagEventOccurred which tells me how many rounds ago the event has last occurred, i.e.:
lagEventOccurred <- c(Inf, Inf, Inf, Inf, Inf, 1, 2, 1, 1, 2, 3, 4, 5, 6, 7)
for the first few entries and so on. Note that if the event occurred in the same round, it is not considered. I only want to look at previous rounds. All of this should be grouped by ID.
ID | round_number B | event_occurred | lagEventOccurred |
---|---|---|---|
1 | 1 | 0 | Inf |
1 | 2 | 0 | Inf |
1 | 3 | 0 | Inf |
1 | 4 | 1 | Inf |
1 | 5 | 0 | 1 |
1 | 6 | 1 | 2 |
1 | 7 | 1 | 1 |
1 | 8 | 0 | 1 |
1 | 9 | 0 | 2 |
1 | 10 | 0 | 3 |
1 | 11 | 0 | 4 |
I tried multiple things, including something like
minimalsample %>% dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) ) %>%
dplyr::mutate(secondstep = ifelse((lag(firststep) < round_number), lag(firststep), firststep )) %>%
dplyr::mutate(thirdstep = ifelse((lag(firststep) < round_number), round_number- lag(firststep), round_number-firststep ))
but it seems that I would have to iterate over the lag()
procedure 20 times to get the desired results. How to return how many lag()
s ago a condition was satisfied?
I have assumed that if the beginning of each "ID" group is a sequence of zeros, it should be treated as Inf
. Using the dplyr
and tidyr
packages, the workflow is:
cumsum()
fill()
to define start and end of each event 'group'seq_along()
to return lagged length since last event occurredlibrary(dplyr)
library(tidyr)
set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20),
round_number= rep(1:20, 5),
event_occurred=rbinom(100, size=1, prob=0.2))
minimalsample <- minimalsample %>%
mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %>%
group_by(ID, lagEventOccurred) %>%
mutate(lagEventOccurred = cumsum(lagEventOccurred)) %>%
ungroup() %>%
group_by(ID) %>%
fill(lagEventOccurred) %>%
mutate(lagEventOccurred = lag(lagEventOccurred)) %>%
group_by(ID, lagEventOccurred) %>%
mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred),
Inf, seq_along(lagEventOccurred))) %>%
ungroup()
head(minimalsample, 10)
# A tibble: 10 × 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <dbl>
1 1 1 0 Inf
2 1 2 0 Inf
3 1 3 0 Inf
4 1 4 1 Inf
5 1 5 0 1
6 1 6 1 2
7 1 7 1 1
8 1 8 0 1
9 1 9 0 2
10 1 10 0 3
tail(minimalsample, 10)
# A tibble: 10 × 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <dbl>
1 5 11 0 Inf
2 5 12 0 Inf
3 5 13 0 Inf
4 5 14 1 Inf
5 5 15 0 1
6 5 16 0 2
7 5 17 0 3
8 5 18 0 4
9 5 19 1 5
10 5 20 0 1