r datetime count data-wrangling multiple-users

Count rows (conditionally) within specified time period by group in R

I am working with user generated data and I want to count the number of rows/activities i.e. Calls by each user made within a specific time period. Here is a mock dataframe, similar to what I am working with:

library(ids)#for generating the UserID variable
library(wakefield)#for generating the Status variable
library(dplyr)

set.seed(123)
UserID<-random_id(n=10, bytes = 5)
DateTime<-seq.POSIXt(from = as.POSIXct("2020-08-01 01:00:00", tz = Sys.timezone()), length.out = 70, by = "15 mins")
df<-cbind(UserID,DateTime)
df<-as.data.frame(df)
df$Status<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=70)
df$DateTime<-seq.POSIXt(from = as.POSIXct("2020-08-01 01:00:00", tz = Sys.timezone()), 
                        length.out = 70, by = "15 mins")#re-doing this again as it annoyingly converts to numeric each time 

df<-df%>%arrange(UserID,DateTime)
head(df)


      #UserID            DateTime    Status
#1 0a5f3a2a8b 2020-08-01 02:00:00   Engaged
#2 0a5f3a2a8b 2020-08-01 04:30:00   Engaged
#3 0a5f3a2a8b 2020-08-01 07:00:00   Engaged
#4 0a5f3a2a8b 2020-08-01 09:30:00   Engaged
#5 0a5f3a2a8b 2020-08-01 12:00:00   Engaged
#6 0a5f3a2a8b 2020-08-01 14:30:00 Abandoned

What I want to do is count the number of Calls by UserID within a 5 hour period, with two other conditions:-

If there isn't another Call within the 5 hour period of the last Call that the User made, then this goes down as a single "attempt"
If the user has N Calls within a 5-hour period until they get "Answered", then that counts as an "successful" attempt. Otherwise, it goes down as an "unsuccessful"

Here is what I am trying to achieve:-

UserId          OrigTime       LastTime          Calls  Status       Successful
0a5f3a2a8b  2020-08-01 02:00:00 2020-08-01 07:00:00 3   Engaged          No
16db61d2bc  2020-08-01 03:15:00 2020-08-01 03:15:00 1   Answered         Yes
6355f7700d  2020-08-01 01:00:00 2020-08-01 06:00:00 3   Answered         Yes
9b9fab9789  2020-08-01 04:15:00 2020-08-01 09:15:00 3   Answered         Yes
...

So OrigTime is the time of their first call within a single attempt, and LastTime is the time of their last call within the same single attempt. The Calls column counts the number of calls the User made within that attempt, Status is the status of the last call within an attempt and "Successful" can be logical, saying whether the last call within that attempt was answered or not.

Any pointers in the right direction would be great. I imagine that there is some data.table or dplyr solution but I haven't done much of this kind of activity before, so not sure where to begin. Big thank you in advance :)

EDIT

@Waldi provided a solution which provided nearly what I needed. Here is the solution that works best so far (slightly amended from @Waldi 's answer):-

CondCount <- function(data,maxdelay){
  result <- list()
  row <- 0
  calls <- 0
  OrigTime <- NA
  n <- nrow(data)
  
  for (i in 1:n) {
    if (is.na(OrigTime)) {
      OrigTime <- data$DateTime[[i]]
      calls <- 0
    }
    calls = calls + 1
    if (data$Status[[i]] == "Answered" | difftime(data$DateTime[[i]],OrigTime,units='hours') > maxdelay | i==n) {
      row <- row + 1
      result[[row]] <- data.frame(OrigTime = OrigTime, LastTime = data$DateTime[[i]], calls = calls, Status = factor(data$Status[[i]],levels=c("Answered" ,"Abandoned" ,"Engaged","Unknown")), Successful = ifelse(data$Status[[i]]=="Answered",'Y','N')  )
      OrigTime <- NA
    }
  }
  dplyr::bind_rows(result)
}

df %>% arrange(UserID,DateTime) %>%
       split(.$UserID) %>%
       map(function(data) {CondCount(data,1) }) %>%
       bind_rows(.id="UserID")

Please see the 2 steps that I had written prior to my edit. This time around, the period is 1 hour and not 5 hour.

With @Waldi 's solution, here is when it works on my real df (I apologise for the colour coding that I have used, if by chance there are any colour-blind SO users):-

Correct outcome

and with @Waldi 's solution, it will give you this:-

which is correct! This is what I am aiming for. However, I want to illustrate two examples of what happens when I run this code, which gives the undesired outcome:-

Incorrect Outcome 1

which gives you this:-

This is incorrect. It should be two rows with one for each attempt (each with a final status of "Abandoned") and not one row, as the time difference between the last two rows is greater than 60 mins.

Incorrect Outcome 2

which gives you this:-

This is incorrect. It should be two rows with one for each attempt (first row with a status of "Engaged", second row with a status of "Answered").

I have to give massive credit to @Waldi because the solution works perfectly for answered calls. However, it does not take into consideration the other status types i.e. Abandoned and Engaged. It may be a case of not enough conditions being met for these two statuses. As always, any help would be appreciated!

Solution

You could use purrr to split data by user, and use a simple for-loop function to implement the logic you're looking for :

library(purrr)

CondCount <- function(data,maxdelay){
  result <- list()
  row <- 0
  calls <- 0
  OrigTime <- NA
  n <- nrow(data)
  
  for (i in 1:n) {
    if (is.na(OrigTime)) {
      OrigTime <- data$DateTime[[i]]
      calls <- 0
    }
    calls = calls + 1
    if (difftime(data$DateTime[[i]],OrigTime,units='hours') > maxdelay) {
      row <- row + 1
      result[[row]] <- data.frame(OrigTime = OrigTime, LastTime = data$DateTime[[i-1]], calls = calls, Status = factor(data$Status[[i-1]],levels=c("Answered" ,"Abandoned" ,"Engaged")), Successful = ifelse(data$Status[[i]]=="Answered",'Y','N')  )
      OrigTime <- data$DateTime[[i]]
    } 
    if ((data$Status[[i]] !="Engaged") | i == n) {
      row <- row + 1
      result[[row]] <- data.frame(OrigTime = OrigTime, LastTime = data$DateTime[[i]], calls = calls, Status = factor(data$Status[[i]],levels=c("Answered" ,"Abandoned" ,"Engaged")), Successful = ifelse(data$Status[[i]]=="Answered",'Y','N')  )
      OrigTime <- NA
    }
  } 
  dplyr::bind_rows(result)
}



df %>% arrange(UserID,DateTime) %>%
  split(.$UserID) %>%
  map(function(data) {CondCount(data,5) }) %>%
  bind_rows(.id="UserID")

       UserID            OrigTime            LastTime calls    Status Successful
1  022098d3cf 2020-08-01 03:15:00 2020-08-01 03:15:00     1  Answered          Y
2  022098d3cf 2020-08-01 05:45:00 2020-08-01 05:45:00     1  Answered          Y
3  022098d3cf 2020-08-01 08:15:00 2020-08-01 08:15:00     1 Abandoned          N
4  022098d3cf 2020-08-01 10:45:00 2020-08-01 10:45:00     1  Answered          Y
5  022098d3cf 2020-08-01 13:15:00 2020-08-01 13:15:00     1 Abandoned          N
6  022098d3cf 2020-08-01 15:45:00 2020-08-01 15:45:00     1 Abandoned          N
7  022098d3cf 2020-08-01 18:15:00 2020-08-01 18:15:00     1 Abandoned          N
8  18f13c3972 2020-08-01 01:15:00 2020-08-01 03:45:00     2 Abandoned          N
9  18f13c3972 2020-08-01 06:15:00 2020-08-01 06:15:00     1  Answered          Y
10 18f13c3972 2020-08-01 08:45:00 2020-08-01 13:45:00     3  Answered          Y

If the loop needs to be very fast, it can easily be converted to Rcpp.

NB : for some reason, set.seed(123) doesn't seem to be enough to create reproducible results.