I am quite new to dplyr and can't figure what I am doing wrong. I have the following dataset:
INSTRUMENT_USED Year UniqueCount
1 QUEST_A 2015 1
2 QUEST_A 2016 1
3 QUEST_A 2017 1
4 QUEST_A 2018 1
5 QUEST_A 2019 1
6 QUEST_A 2020 1
7 QUEST_A 2021 0
8 QUEST_A 2022 0
9 QUEST_A 2023 0
10 QUEST_B 2015 1
11 QUEST_B 2016 1
12 QUEST_B 2017 1
13 QUEST_B 2018 1
14 QUEST_B 2019 0
15 QUEST_B 2020 0
16 QUEST_B 2021 1
17 QUEST_B 2022 0
18 QUEST_B 2023 0
And I would like to create a variable 'dbreak' that indicates a break when the two following conditions are met:
So in this example above the "dbreak" variable should be "No" everywhere except for QUEST_A 2021 where it would be "Yes"
I somehow managed to make it work manually but as soon as I try to make summarise
conditional on the variable Year
it does not seem to work anymore.
When I run the code below I get the expected result
df <- df %>%
group_by(INSTRUMENT_USED) %>%
arrange(INSTRUMENT_USED,Year) %>%
mutate(
prev = lag(UniqueCount),
dbreak = ifelse(UniqueCount==0 & prev == 1 &
all(UniqueCount[Year <= 2020] == 1) &
all(UniqueCount[Year >= 2021] ==0), "YES", "No"))
INSTRUMENT_USED Year UniqueCount prev dbreak
<fct> <dbl> <dbl> <dbl> <chr>
1 QUEST_A 2015 1 NA No
2 QUEST_A 2016 1 1 No
3 QUEST_A 2017 1 1 No
4 QUEST_A 2018 1 1 No
5 QUEST_A 2019 1 1 No
6 QUEST_A 2020 1 1 No
7 QUEST_A 2021 0 1 Yes
8 QUEST_A 2022 0 0 No
9 QUEST_A 2023 0 0 No
10 QUEST_B 2015 1 NA No
11 QUEST_B 2016 1 1 No
12 QUEST_B 2017 1 1 No
13 QUEST_B 2018 1 1 No
14 QUEST_B 2019 0 1 No
15 QUEST_B 2020 0 0 No
16 QUEST_B 2021 1 0 No
17 QUEST_B 2022 0 1 No
18 QUEST_B 2023 0 0 No
But when I replace the hardcoded years by the Year
variable it does not work anymore, or rather I can not find what is wrong with my condition. If I keep it exactly the same is evaluates all to "No", and if I modify to remove the '=' sign in the second all()
statement it identifies two breaks for QUEST_B (2019 and 2022).
df<- df %>%
group_by(INSTRUMENT_USED) %>%
arrange(INSTRUMENT_USED,Year) %>%
mutate(
prev = lag(UniqueCount),
dbreak = ifelse(UniqueCount==0 & prev == 1 &
all(UniqueCount[Year <= Year-1] == 1) &
all(UniqueCount[Year > Year] ==0), "Yes", "No"))
INSTRUMENT_USED Year UniqueCount prev dbreak
<fct> <dbl> <dbl> <dbl> <chr>
1 QUEST_A 2015 1 NA No
2 QUEST_A 2016 1 1 No
3 QUEST_A 2017 1 1 No
4 QUEST_A 2018 1 1 No
5 QUEST_A 2019 1 1 No
6 QUEST_A 2020 1 1 No
7 QUEST_A 2021 0 1 Yes
8 QUEST_A 2022 0 0 No
9 QUEST_A 2023 0 0 No
10 QUEST_B 2015 1 NA No
11 QUEST_B 2016 1 1 No
12 QUEST_B 2017 1 1 No
13 QUEST_B 2018 1 1 No
14 QUEST_B 2019 0 1 Yes
15 QUEST_B 2020 0 0 No
16 QUEST_B 2021 1 0 No
17 QUEST_B 2022 0 1 Yes
18 QUEST_B 2023 0 0 No
Any idea?
I would calculate the leading difference as a helper column, and then test if (a) there is exactly one difference of 1, (b) all other differences are 0, and (c) the current row's difference is 1, if so 'yes' else 'no':
df %>%
group_by(INSTRUMENT_USED) %>%
arrange(INSTRUMENT_USED,Year) %>%
mutate(
diff = UniqueCount - lead(UniqueCount, default = 0),
dbreak = ifelse(
sum(diff == 1) == 1 & sum(diff == 0) == (n() - 1) & diff == 1,
"Yes", "No"
)
)
# # A tibble: 18 × 5
# # Groups: INSTRUMENT_USED [2]
# INSTRUMENT_USED Year UniqueCount diff dbreak
# <chr> <int> <int> <int> <chr>
# 1 QUEST_A 2015 1 0 No
# 2 QUEST_A 2016 1 0 No
# 3 QUEST_A 2017 1 0 No
# 4 QUEST_A 2018 1 0 No
# 5 QUEST_A 2019 1 0 No
# 6 QUEST_A 2020 1 1 Yes
# 7 QUEST_A 2021 0 0 No
# 8 QUEST_A 2022 0 0 No
# 9 QUEST_A 2023 0 0 No
# 10 QUEST_B 2015 1 0 No
# 11 QUEST_B 2016 1 0 No
# 12 QUEST_B 2017 1 0 No
# 13 QUEST_B 2018 1 1 No
# 14 QUEST_B 2019 0 0 No
# 15 QUEST_B 2020 0 -1 No
# 16 QUEST_B 2021 1 1 No
# 17 QUEST_B 2022 0 0 No
# 18 QUEST_B 2023 0 0 No