I'm using dbplyr to access data in Redshift tables and Tidyverse to do the data wrangling. I'm trying to calculate a rolling sum amount over 7 days.
Unfortunately, none of the packages that I've found suggested on stack overflow, such as 'slider' and 'runner', seem to be compatible with dbplyr (or with the Redshift tables and sql that I'm using).
I'm trying to achieve similar results to this table. In this example, the value in the cum_sum_7Days
column is the sum of values in the amount
column where the start_date
fits between the start_date
and previous_7Day
columns.
Acct Start_Date Previous_7Day Amount Cum_sum_7Days
YYYY 8/07/2022 7:04 1/07/2022 7:04 500 500
YYYY 8/07/2022 12:49 1/07/2022 12:49 200 700
YYYY 9/07/2022 11:47 2/07/2022 11:47 300 1000
YYYY 9/07/2022 11:52 2/07/2022 11:52 45.6 1045.6
YYYY 12/07/2022 13:03 5/07/2022 13:03 200 1245.6
YYYY 15/07/2022 13:53 8/07/2022 13:53 200 745.6
YYYY 16/07/2022 12:58 9/07/2022 12:58 300 700
YYYY 16/07/2022 13:28 9/07/2022 13:28 500 1200
YYYY 19/07/2022 12:22 12/07/2022 12:22 200 1400
YYYY 23/07/2022 5:52 16/07/2022 5:52 200 1200
YYYY 26/07/2022 13:01 19/07/2022 13:01 100 300
YYYY 29/07/2022 13:50 22/07/2022 13:50 200 500
YYYY 30/07/2022 13:57 23/07/2022 13:57 300 600
YYYY 3/08/2022 6:17 27/07/2022 6:17 200 700
YYYY 5/08/2022 13:30 29/07/2022 13:30 200 900
YYYY 9/08/2022 13:44 2/08/2022 13:44 200 600
YYYY 12/08/2022 12:13 5/08/2022 12:13 200 600
Note that:
So, any solution need to handle this.
Yes, this should be possible with dbplyr. But the method will look different that via pure R.
dbplyr works by translating tidyverse commands into SQL. If no translation is defined for a function then it is unlikely to work correctly. This is why using functions like sum_run
from the runner
package give a no applicable method
error. dbplyr does not know how to translate them into SQL.
The method we can use follows similar logic to how we would approach this problem in SQL:
output = tbl1 %>%
inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
filter(
Previous_7Day.x <= Start_Date.y,
Start_Date.y <= Start_Date.x
) %>%
group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
summarise(Cum_sum_7Days = sum(Amount.y) %>%
select(
Acct = Acct.x,
Start_Date = Start_Date.x,
Previous_7Day = Previous_7Day.x,
Amount = Amount.x,
Cum_sum_7Days
)
The core of the idea is to join the table to itself. After filtering we have all combinations where the .y
date is between the .x
dates. So we can sum all the .y
amounts to produce the rolling sum.