rdatelagcumsum

lagged cumsum in R


I have data that looks like the following:

df <- tibble(
      date= seq.Date(as.Date("2021-01-01"), as.Date("2022-02-01"), by = "month"),
      val1 = c(100, 100, 105, 125, 125, 125, 125, 132, 132, 132, 135, 150, 150, 150),
      val2 = c(100, 100, 100, 125, 125, 125, 125, 125, 125, 125, 125, 150, 150, 150),
      diff = val1-val2)

       date        val1  val2  diff
       <date>     <dbl> <dbl> <dbl>
     1 2021-01-01   100   100     0
     2 2021-02-01   100   100     0
     3 2021-03-01   105   100     5
     4 2021-04-01   125   125     0
     5 2021-05-01   125   125     0
     6 2021-06-01   125   125     0
     7 2021-07-01   125   125     0
     8 2021-08-01   132   125     7
     9 2021-09-01   132   125     7
    10 2021-10-01   132   125     7
    11 2021-11-01   135   125    10
    12 2021-12-01   150   150     0
    13 2022-01-01   150   150     0
    14 2022-02-01   150   150     0

I am trying to produce the following output:

output <- tibble(
  date= seq.Date(as.Date("2021-01-01"), as.Date("2022-02-01"), by = "month"),
  val1 = c(100, 100, 105, 125, 125, 125, 125, 132, 132, 132, 135, 150, 150, 150),
  val2 = c(100, 100, 100, 125, 125, 125, 125, 125, 125, 125, 125, 150, 150, 150),
  diff = val1-val2,
  diff_calc = c(0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 15, 15, 15))

 date        val1  val2  diff diff_calc
 <date>     <dbl> <dbl> <dbl>     <dbl>
 1 2021-01-01   100   100     0         0
 2 2021-02-01   100   100     0         0
 3 2021-03-01   105   100     5         0
 4 2021-04-01   125   125     0         5
 5 2021-05-01   125   125     0         5
 6 2021-06-01   125   125     0         5
 7 2021-07-01   125   125     0         5
 8 2021-08-01   132   125     7         5
 9 2021-09-01   132   125     7         5
10 2021-10-01   132   125     7         5
11 2021-11-01   135   125    10         5
12 2021-12-01   150   150     0        15
13 2022-01-01   150   150     0        15
14 2022-02-01   150   150     0        15

Where diff_calc is the cumulative sum of the previous unique values in diff, unless multiple unique diff values occur consecutively, then it should be the largest value plus the cumulative sum of any of the previous diff values using the same logic.

This is a derivative of this question I asked previously, but I realized I didn't provide the best example or description to accommodate what I need here, so posting as a new question. Thanks!


Solution

  • Two (similar) options:

    Option 1: Largest value in diff not necessarily the last value

    library(dplyr)
    
    output |>
      group_by(val2) |>
      mutate(tmp = max(diff),
             tmp = replace(tmp, 1:n()-1, 0)) |>
      ungroup() |>
      mutate(diff_calc = lag(cumsum(tmp), default = 0)) |>
      select(-tmp)
     
    # # A tibble: 14 × 5
    #    date        val1  val2  diff diff_calc
    #    <date>     <dbl> <dbl> <dbl>     <dbl>
    #  1 2021-01-01   100   100     0         0
    #  2 2021-02-01   100   100     0         0
    #  3 2021-03-01   105   100     5         0
    #  4 2021-04-01   125   125     0         5
    #  5 2021-05-01   125   125     0         5
    #  6 2021-06-01   125   125     0         5
    #  7 2021-07-01   125   125     0         5
    #  8 2021-08-01   132   125     7         5
    #  9 2021-09-01   132   125     7         5
    # 10 2021-10-01   132   125     7         5
    # 11 2021-11-01   135   125    10         5
    # 12 2021-12-01   150   150     0        15
    # 13 2022-01-01   150   150     0        15
    # 14 2022-02-01   150   150     0        15
    

    Option 2: Last value in diff per var2 group always largest

    output |>
      group_by(val2) |>
      mutate(tmp = last(diff),
             tmp = replace(tmp, 1:n()-1, 0)) |>
      ungroup() |>
      mutate(diff_calc = lag(cumsum(tmp), default = 0)) |>
      select(-tmp)