rdplyrlong-format-data

New variable calculation with input from multple groups in long format


I was wondering whether the following calculation is possible using dplyr without transforming my data into wide format. My data looks like the following:

data <- data.frame(ID = c(rep(1:2, 6)),
                   Date = c(rep(as.Date('2022-03-01'), 4), rep(as.Date('2022-03-02'), 4), rep(as.Date('2022-03-03'), 4)),
                   Type = rep(LETTERS[c(1,1,2,2)], 3),
                   Value = c(1,2,101,102,3,4,103,104,5,6,105,106))

My goal is to make a calculation, which involves the value of a certain day from type B, but as well the value from the previous day of type A AND type B. If the calculation would only be within one group, then dplyr::lag is the way to go. But I do not see the way in this case. I'd like to avoid pivoting my data into wide format.

So as an example, I'd like to calculate X = B(t) - A(t-1) * B(t-1), where t is denoting the date. My goal in this case would be something like the following dataframe:

data_goal <- data.frame(ID = c(rep(1:2, 3)),
                        Date = c(rep(as.Date('2022-03-01'), 2), rep(as.Date('2022-03-02'), 2), rep(as.Date('2022-03-03'), 2)),
                        X = c(NA, NA, 103 - 1 * 101, 104 - 2 * 102, 105 - 3 * 103, 106 - 6 * 104))

If I would calculate the daily difference for each type on its own, my solution would be

data |>
  dplyr::arrange(Date) |>
  dplyr::group_by(ID, Type) |>
  dplyr::mutate(Diff = Value - dplyr::lag(Value, n = 1))

But unfortunately I have no idea how I might extend this.

Any help is highly appreciated!

Thanks a lot!

Note that I am also glad to know, if this is not possible. Then I would move on to transforming the table into wide format and continue from there. My actual data has a lot more types, which is why I'd like to avoid that.


Solution

  • it may be useful

    data <- data.frame(
      ID = c(rep(1:2, 6)),
      Date = c(rep(as.Date('2022-03-01'), 4), rep(as.Date('2022-03-02'), 4), rep(as.Date('2022-03-03'), 4)),
      Type = rep(LETTERS[c(1, 1, 2, 2)], 3),
      Value = c(1, 2, 101, 102, 3, 4, 103, 104, 5, 6, 105, 106)
    )
    
    library(tidyverse)
    
    data %>%
      group_by(Date) %>%
      mutate(grp = cur_group_id()) %>%
      ungroup() %>%
      summarise(Diff = map(.x = seq(max(grp)),
                           .f = ~ Value[Type == "B" &
                                          grp == .x] - Value[Type == "A" &
                                                               grp == .x - 1] * Value[Type == "B" &
                                                                                        grp == .x - 1])) %>%
      unnest(Diff) %>%
      add_case(Diff = rep(NA, length(unique(data$ID))), .before = 1) %>%
      add_column(distinct(data, ID, Date), .before = 1)
    #> # A tibble: 6 × 3
    #>      ID Date        Diff
    #>   <int> <date>     <dbl>
    #> 1     1 2022-03-01    NA
    #> 2     2 2022-03-01    NA
    #> 3     1 2022-03-02     2
    #> 4     2 2022-03-02  -100
    #> 5     1 2022-03-03  -204
    #> 6     2 2022-03-03  -310
    

    Created on 2022-04-26 by the reprex package (v2.0.1)