I have data that looks like the following:
df <- tibble(
date= seq.Date(as.Date("2021-01-01"), as.Date("2022-02-01"), by = "month"),
val1 = c(100, 100, 105, 125, 125, 125, 125, 132, 132, 132, 135, 150, 150, 150),
val2 = c(100, 100, 100, 125, 125, 125, 125, 125, 125, 125, 125, 150, 150, 150),
diff = val1-val2)
date val1 val2 diff
<date> <dbl> <dbl> <dbl>
1 2021-01-01 100 100 0
2 2021-02-01 100 100 0
3 2021-03-01 105 100 5
4 2021-04-01 125 125 0
5 2021-05-01 125 125 0
6 2021-06-01 125 125 0
7 2021-07-01 125 125 0
8 2021-08-01 132 125 7
9 2021-09-01 132 125 7
10 2021-10-01 132 125 7
11 2021-11-01 135 125 10
12 2021-12-01 150 150 0
13 2022-01-01 150 150 0
14 2022-02-01 150 150 0
I am trying to produce the following output:
output <- tibble(
date= seq.Date(as.Date("2021-01-01"), as.Date("2022-02-01"), by = "month"),
val1 = c(100, 100, 105, 125, 125, 125, 125, 132, 132, 132, 135, 150, 150, 150),
val2 = c(100, 100, 100, 125, 125, 125, 125, 125, 125, 125, 125, 150, 150, 150),
diff = val1-val2,
diff_calc = c(0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 15, 15, 15))
date val1 val2 diff diff_calc
<date> <dbl> <dbl> <dbl> <dbl>
1 2021-01-01 100 100 0 0
2 2021-02-01 100 100 0 0
3 2021-03-01 105 100 5 0
4 2021-04-01 125 125 0 5
5 2021-05-01 125 125 0 5
6 2021-06-01 125 125 0 5
7 2021-07-01 125 125 0 5
8 2021-08-01 132 125 7 5
9 2021-09-01 132 125 7 5
10 2021-10-01 132 125 7 5
11 2021-11-01 135 125 10 5
12 2021-12-01 150 150 0 15
13 2022-01-01 150 150 0 15
14 2022-02-01 150 150 0 15
Where diff_calc
is the cumulative sum of the previous unique values in diff
, unless multiple unique diff
values occur consecutively, then it should be the largest value plus the cumulative sum of any of the previous diff
values using the same logic.
This is a derivative of this question I asked previously, but I realized I didn't provide the best example or description to accommodate what I need here, so posting as a new question. Thanks!
Two (similar) options:
Option 1: Largest value in diff not necessarily the last value
library(dplyr)
output |>
group_by(val2) |>
mutate(tmp = max(diff),
tmp = replace(tmp, 1:n()-1, 0)) |>
ungroup() |>
mutate(diff_calc = lag(cumsum(tmp), default = 0)) |>
select(-tmp)
# # A tibble: 14 × 5
# date val1 val2 diff diff_calc
# <date> <dbl> <dbl> <dbl> <dbl>
# 1 2021-01-01 100 100 0 0
# 2 2021-02-01 100 100 0 0
# 3 2021-03-01 105 100 5 0
# 4 2021-04-01 125 125 0 5
# 5 2021-05-01 125 125 0 5
# 6 2021-06-01 125 125 0 5
# 7 2021-07-01 125 125 0 5
# 8 2021-08-01 132 125 7 5
# 9 2021-09-01 132 125 7 5
# 10 2021-10-01 132 125 7 5
# 11 2021-11-01 135 125 10 5
# 12 2021-12-01 150 150 0 15
# 13 2022-01-01 150 150 0 15
# 14 2022-02-01 150 150 0 15
Option 2: Last value in diff per var2 group always largest
output |>
group_by(val2) |>
mutate(tmp = last(diff),
tmp = replace(tmp, 1:n()-1, 0)) |>
ungroup() |>
mutate(diff_calc = lag(cumsum(tmp), default = 0)) |>
select(-tmp)