I have a large dataset similar to the following one.
ID | Date | Change | wage |
---|---|---|---|
201 | 04.01.2023 | B-AL | 900 |
201 | 01.03.2023 | AL-AL | 900 |
201 | 01.10.2023 | AL-AL | 900 |
201 | 10.11.2023 | AL-B | 1900 |
201 | 28.11.2023 | B-AL | 900 |
201 | 10.12.2023 | AL-B | 1905 |
301 | 03.01.2023 | B-AL | 1100 |
301 | 11.02.2023 | AL-B | 2500 |
301 | 15.04.2023 | B-AL | 1110 |
301 | 27.09.2023 | AL-B | 2506 |
401 | 02.01.2023 | B-AL | 700 |
401 | 28.01.2023 | AL-X | 0 |
401 | 10.02.2023 | AL-AL | 700 |
401 | 02.04.2023 | AL-AL | 700 |
401 | 10.05.2023 | AL-B | 2200 |
401 | 13.08.2023 | B-AL | 800 |
I want to calculate the average wage by each status, but with two restirctions: First, I only want to consider statuses of each individual until the status "B-AL" is reached the second time. All rows after the second occurance of "B-AL" by each ID should not be considered in the calculation.
(I think) I solved this by creating two help variables and then filter by those with cumsum()
data <- data %>%
arrange(date) %>%
mutate(num = row_number(),
num2 = ifelse(change == "B-AL" & num == 2, 1, 0),
.by = c(ID, change))
data <- data %>%
group_by(ID) %>%
filter(cumsum(num2) == 0) %>%
ungroup()
Now the second restiction: As soon as one ID has status "AL-X" all subseqent rows should not bne considered as well.
If I use the same approach as above I also loose the observation where "AL-X" occurs the first time. However, I want this one to be included and all subsequent ones to be excluded. After both operations the table should look like this
ID | Date | Change | wage |
---|---|---|---|
201 | 04.01.2023 | B-AL | 900 |
201 | 01.03.2023 | AL-AL | 900 |
201 | 01.10.2023 | AL-AL | 900 |
201 | 10.11.2023 | AL-B | 1900 |
301 | 03.01.2023 | B-AL | 1100 |
301 | 11.02.2023 | AL-B | 2500 |
401 | 02.01.2023 | B-AL | 700 |
401 | 28.01.2023 | AL-X | 0 |
After that I can calculate the average wages with summarise (first by status and then by ID)
Any help is appreciated. I am also not sure whether the first operation is sound and efficient.
Judging by the desired output you want
library(dplyr)
df %>%
filter(cumsum(Change == c("B-AL")) < 2,
lag(cumsum(Change == c("AL-X")), default=0) < 1, .by = ID)
ID Date Change wage
1 201 04.01.2023 B-AL 900
2 201 01.03.2023 AL-AL 900
3 201 01.10.2023 AL-AL 900
4 201 10.11.2023 AL-B 1900
5 301 03.01.2023 B-AL 1100
6 301 11.02.2023 AL-B 2500
7 401 02.01.2023 B-AL 700
8 401 28.01.2023 AL-X 0