rdatetidyverserunner

Calculating running amounts with uneven number of dates


I have a table full of dates and reply-to times for technical support tickets, and I'd like to calculate a running rate to find the average reply time over the previous n number of days. The data is in the following format.

Dates Reply Time Ticket ID
2024-01-02 341 1
2024-01-02 31 2
2024-01-03 321 3
2024-01-05 412 4
2024-01-07 93 5
2024-01-07 169 6

I can solve this problem by calculating the average reply time each day and then calculating the average reply time over the previous n number of days from that, but this doesn't take into account the number of observations each day, which will skew the results if there are outliers for certain days. I want to account for the number of observations when I'm averaging to prevent outliers from throwing off the data.

Here I'm using the package runner to get the average reply time each day and calculating a moving average from that.

daily_reply_time <- df_replies %>%
  filter(!is.na(reply_time) & !is.na(dates)) %>%
  group_by(dates) %>%
  reframe(avg_reply_time = mean(reply_time, na.rm = TRUE)) %>%
  mutate(
    x = "x",
    dates = lubridate::ymd(dates)
  ) %>%
  filter(!is.na(dates)) %>%
  complete(
    nesting(x),
    dates = seq(min(dates), max(dates), by = "day")
  ) %>%
  group_by(x) %>%
  arrange(dates) %>%
  mutate(
    dates= lubridate::ymd(dates),
    avg_reply_time = ifelse(is.na(avg_reply_time), 0, as.numeric(avg_reply_time )),
    running_reply_time_30_days = runner::mean_run(x = avg_reply_time, k = 30, idx = dates)
  ) %>%
  select(-x)

I create a dummy variable x so that nesting works properly; I assume there's a way to skip over that I'm unaware of. Anyways, this would give me averages of 186, 321, 0, 412, 0, and 131 for each day, so when I use runner, I get a moving average of 175 on 2024-01-08, instead of the 227.83 that is expected when you just sum up all the numbers and divide by the number of observations.

If I skip grouping by each date and instead use the complete function, I get an error saying that "'from' must be a finite number." And not using complete and trying to use the runner package does not throw an error, but averages over the previous n rows in the dataset instead of the number of dates.

daily_reply_time <- df_replies %>%
  filter(created_at > '2023-12-31') %>%
  mutate(
    created_at = substr(created_at, 1, 10),
    first_reply_time_in_minutes = first_reply_time_in_minutes / 60
  ) %>%
  filter(!is.na(created_at) & !is.na(first_reply_time_in_minutes)) %>%
  mutate(x = "x") %>%
  complete( 
    nesting(x),
    created_at = seq(min(created_at), max(created_at), by = "day")
  )

Is there a way to account for the number of observations when calculating running amounts using runner, or through some other package?

EDIT: The expected output would contain one row for each date, and a moving average reply time over the previous n days, meaning the input and output would have different lengths (amounts are not supposed to match up with the table above, just an example of the expected output given a particular dataframe).

Dates Moving Avg. Reply Time
2024-01-02 125
2024-01-03 108.3
2024-01-04 108.3
2024-01-05 137
2024-01-06 67
2024-01-07 251

Solution

  • To do it I would use the roll... functions from the {zoo} package.

    library(dplyr)
    library(zoo)
    

    First let’s generate some data. each day in january will be present at least once.

    set.seed(123)
    january_dates <- seq(as.Date("2024-01-01"), length.out = 31, by = "day")
    duplicate_dates <- sample(
      x = seq(as.Date("2024-01-01"), length.out = 31, by = "day"),
      size = 29,
      replace = TRUE
    )
    
    data <- data.frame(
      ticket_id = 1:60,
      date = c(january_dates, duplicate_dates),
      reply_time = sample(1:300, size = 60, replace = TRUE)
    )
    
    head(data)
    #>   ticket_id       date reply_time
    #> 1         1 2024-01-01        137
    #> 2         2 2024-01-02        254
    #> 3         3 2024-01-03        211
    #> 4         4 2024-01-04         78
    #> 5         5 2024-01-05         81
    #> 6         6 2024-01-06         43
    

    Now let’s calculate the total_reply_time and the number_of_tickets by day.

    summary <- data |>
      arrange(date) |>
      summarise(
        total_reply_time = sum(reply_time),
        number_of_tickets = n(),
        .by = date
      )
    

    The last step is to get the weighted rolling average

    summary |>
      mutate(
        rollsum_reply_time = zoo::rollsum(total_reply_time, k = 7, fill = NA, align = "right"),
        rollsum_tickerts = zoo::rollsum(number_of_tickets, k = 7, fill = NA, align = "right"),
        rolling_average = rollsum_reply_time / rollsum_tickerts
      )
    #>          date total_reply_time number_of_tickets rollsum_reply_time
    #> 1  2024-01-01              137                 1                 NA
    #> 2  2024-01-02              254                 1                 NA
    #> 3  2024-01-03              521                 3                 NA
    #> 4  2024-01-04               78                 1                 NA
    #> 5  2024-01-05              380                 3                 NA
    #> 6  2024-01-06               43                 1                 NA
    #> 7  2024-01-07              279                 2               1692
    #> 8  2024-01-08              117                 2               1672
    #> 9  2024-01-09              308                 2               1726
    #> 10 2024-01-10              554                 3               1759
    #> 11 2024-01-11               27                 2               1708
    #> 12 2024-01-12              135                 1               1463
    #> 13 2024-01-13              224                 1               1644
    #> 14 2024-01-14              448                 3               1813
    #> 15 2024-01-15              452                 2               2148
    #> 16 2024-01-16              290                 1               2130
    #> 17 2024-01-17               69                 1               1645
    #> 18 2024-01-18              281                 2               1899
    #> 19 2024-01-19              321                 3               2085
    #> 20 2024-01-20              132                 2               1993
    #> 21 2024-01-21              141                 1               1686
    #> 22 2024-01-22              522                 3               1756
    #> 23 2024-01-23              153                 1               1619
    #> 24 2024-01-24              294                 1               1844
    #> 25 2024-01-25              540                 4               2103
    #> 26 2024-01-26              231                 3               2013
    #> 27 2024-01-27              502                 3               2383
    #> 28 2024-01-28              381                 2               2623
    #> 29 2024-01-29               83                 2               2184
    #> 30 2024-01-30              116                 1               2147
    #> 31 2024-01-31              356                 2               2209
    #>    rollsum_tickerts rolling_average
    #> 1                NA              NA
    #> 2                NA              NA
    #> 3                NA              NA
    #> 4                NA              NA
    #> 5                NA              NA
    #> 6                NA              NA
    #> 7                12        141.0000
    #> 8                13        128.6154
    #> 9                14        123.2857
    #> 10               14        125.6429
    #> 11               15        113.8667
    #> 12               13        112.5385
    #> 13               13        126.4615
    #> 14               14        129.5000
    #> 15               14        153.4286
    #> 16               13        163.8462
    #> 17               11        149.5455
    #> 18               11        172.6364
    #> 19               13        160.3846
    #> 20               14        142.3571
    #> 21               12        140.5000
    #> 22               13        135.0769
    #> 23               13        124.5385
    #> 24               13        141.8462
    #> 25               15        140.2000
    #> 26               15        134.2000
    #> 27               16        148.9375
    #> 28               17        154.2941
    #> 29               16        136.5000
    #> 30               16        134.1875
    #> 31               17        129.9412
    

    Created on 2024-03-26 with reprex v2.0.2