I have a table full of dates and reply-to times for technical support tickets, and I'd like to calculate a running rate to find the average reply time over the previous n number of days. The data is in the following format.
Dates | Reply Time | Ticket ID |
---|---|---|
2024-01-02 | 341 | 1 |
2024-01-02 | 31 | 2 |
2024-01-03 | 321 | 3 |
2024-01-05 | 412 | 4 |
2024-01-07 | 93 | 5 |
2024-01-07 | 169 | 6 |
I can solve this problem by calculating the average reply time each day and then calculating the average reply time over the previous n number of days from that, but this doesn't take into account the number of observations each day, which will skew the results if there are outliers for certain days. I want to account for the number of observations when I'm averaging to prevent outliers from throwing off the data.
Here I'm using the package runner
to get the average reply time each day and calculating a moving average from that.
daily_reply_time <- df_replies %>%
filter(!is.na(reply_time) & !is.na(dates)) %>%
group_by(dates) %>%
reframe(avg_reply_time = mean(reply_time, na.rm = TRUE)) %>%
mutate(
x = "x",
dates = lubridate::ymd(dates)
) %>%
filter(!is.na(dates)) %>%
complete(
nesting(x),
dates = seq(min(dates), max(dates), by = "day")
) %>%
group_by(x) %>%
arrange(dates) %>%
mutate(
dates= lubridate::ymd(dates),
avg_reply_time = ifelse(is.na(avg_reply_time), 0, as.numeric(avg_reply_time )),
running_reply_time_30_days = runner::mean_run(x = avg_reply_time, k = 30, idx = dates)
) %>%
select(-x)
I create a dummy variable x
so that nesting works properly; I assume there's a way to skip over that I'm unaware of. Anyways, this would give me averages of 186, 321, 0, 412, 0, and 131 for each day, so when I use runner
, I get a moving average of 175 on 2024-01-08, instead of the 227.83 that is expected when you just sum up all the numbers and divide by the number of observations.
If I skip grouping by each date and instead use the complete
function, I get an error saying that "'from' must be a finite number." And not using complete
and trying to use the runner
package does not throw an error, but averages over the previous n rows in the dataset instead of the number of dates.
daily_reply_time <- df_replies %>%
filter(created_at > '2023-12-31') %>%
mutate(
created_at = substr(created_at, 1, 10),
first_reply_time_in_minutes = first_reply_time_in_minutes / 60
) %>%
filter(!is.na(created_at) & !is.na(first_reply_time_in_minutes)) %>%
mutate(x = "x") %>%
complete(
nesting(x),
created_at = seq(min(created_at), max(created_at), by = "day")
)
Is there a way to account for the number of observations when calculating running amounts using runner
, or through some other package?
EDIT: The expected output would contain one row for each date, and a moving average reply time over the previous n days, meaning the input and output would have different lengths (amounts are not supposed to match up with the table above, just an example of the expected output given a particular dataframe).
Dates | Moving Avg. Reply Time |
---|---|
2024-01-02 | 125 |
2024-01-03 | 108.3 |
2024-01-04 | 108.3 |
2024-01-05 | 137 |
2024-01-06 | 67 |
2024-01-07 | 251 |
To do it I would use the roll...
functions from the {zoo}
package.
library(dplyr)
library(zoo)
First let’s generate some data. each day in january will be present at least once.
set.seed(123)
january_dates <- seq(as.Date("2024-01-01"), length.out = 31, by = "day")
duplicate_dates <- sample(
x = seq(as.Date("2024-01-01"), length.out = 31, by = "day"),
size = 29,
replace = TRUE
)
data <- data.frame(
ticket_id = 1:60,
date = c(january_dates, duplicate_dates),
reply_time = sample(1:300, size = 60, replace = TRUE)
)
head(data)
#> ticket_id date reply_time
#> 1 1 2024-01-01 137
#> 2 2 2024-01-02 254
#> 3 3 2024-01-03 211
#> 4 4 2024-01-04 78
#> 5 5 2024-01-05 81
#> 6 6 2024-01-06 43
Now let’s calculate the total_reply_time
and the number_of_tickets
by day.
summary <- data |>
arrange(date) |>
summarise(
total_reply_time = sum(reply_time),
number_of_tickets = n(),
.by = date
)
The last step is to get the weighted rolling average
summary |>
mutate(
rollsum_reply_time = zoo::rollsum(total_reply_time, k = 7, fill = NA, align = "right"),
rollsum_tickerts = zoo::rollsum(number_of_tickets, k = 7, fill = NA, align = "right"),
rolling_average = rollsum_reply_time / rollsum_tickerts
)
#> date total_reply_time number_of_tickets rollsum_reply_time
#> 1 2024-01-01 137 1 NA
#> 2 2024-01-02 254 1 NA
#> 3 2024-01-03 521 3 NA
#> 4 2024-01-04 78 1 NA
#> 5 2024-01-05 380 3 NA
#> 6 2024-01-06 43 1 NA
#> 7 2024-01-07 279 2 1692
#> 8 2024-01-08 117 2 1672
#> 9 2024-01-09 308 2 1726
#> 10 2024-01-10 554 3 1759
#> 11 2024-01-11 27 2 1708
#> 12 2024-01-12 135 1 1463
#> 13 2024-01-13 224 1 1644
#> 14 2024-01-14 448 3 1813
#> 15 2024-01-15 452 2 2148
#> 16 2024-01-16 290 1 2130
#> 17 2024-01-17 69 1 1645
#> 18 2024-01-18 281 2 1899
#> 19 2024-01-19 321 3 2085
#> 20 2024-01-20 132 2 1993
#> 21 2024-01-21 141 1 1686
#> 22 2024-01-22 522 3 1756
#> 23 2024-01-23 153 1 1619
#> 24 2024-01-24 294 1 1844
#> 25 2024-01-25 540 4 2103
#> 26 2024-01-26 231 3 2013
#> 27 2024-01-27 502 3 2383
#> 28 2024-01-28 381 2 2623
#> 29 2024-01-29 83 2 2184
#> 30 2024-01-30 116 1 2147
#> 31 2024-01-31 356 2 2209
#> rollsum_tickerts rolling_average
#> 1 NA NA
#> 2 NA NA
#> 3 NA NA
#> 4 NA NA
#> 5 NA NA
#> 6 NA NA
#> 7 12 141.0000
#> 8 13 128.6154
#> 9 14 123.2857
#> 10 14 125.6429
#> 11 15 113.8667
#> 12 13 112.5385
#> 13 13 126.4615
#> 14 14 129.5000
#> 15 14 153.4286
#> 16 13 163.8462
#> 17 11 149.5455
#> 18 11 172.6364
#> 19 13 160.3846
#> 20 14 142.3571
#> 21 12 140.5000
#> 22 13 135.0769
#> 23 13 124.5385
#> 24 13 141.8462
#> 25 15 140.2000
#> 26 15 134.2000
#> 27 16 148.9375
#> 28 17 154.2941
#> 29 16 136.5000
#> 30 16 134.1875
#> 31 17 129.9412
Created on 2024-03-26 with reprex v2.0.2