I am working with user generated data and I want to count the number of rows/activities i.e. Calls by each user made within a specific time period. Here is a mock dataframe, similar to what I am working with:
library(ids)#for generating the UserID variable
library(wakefield)#for generating the Status variable
library(dplyr)
set.seed(123)
UserID<-random_id(n=10, bytes = 5)
DateTime<-seq.POSIXt(from = as.POSIXct("2020-08-01 01:00:00", tz = Sys.timezone()), length.out = 70, by = "15 mins")
df<-cbind(UserID,DateTime)
df<-as.data.frame(df)
df$Status<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=70)
df$DateTime<-seq.POSIXt(from = as.POSIXct("2020-08-01 01:00:00", tz = Sys.timezone()),
length.out = 70, by = "15 mins")#re-doing this again as it annoyingly converts to numeric each time
df<-df%>%arrange(UserID,DateTime)
head(df)
#UserID DateTime Status
#1 0a5f3a2a8b 2020-08-01 02:00:00 Engaged
#2 0a5f3a2a8b 2020-08-01 04:30:00 Engaged
#3 0a5f3a2a8b 2020-08-01 07:00:00 Engaged
#4 0a5f3a2a8b 2020-08-01 09:30:00 Engaged
#5 0a5f3a2a8b 2020-08-01 12:00:00 Engaged
#6 0a5f3a2a8b 2020-08-01 14:30:00 Abandoned
What I want to do is count the number of Calls by UserID
within a 5 hour period, with two other conditions:-
Here is what I am trying to achieve:-
UserId OrigTime LastTime Calls Status Successful
0a5f3a2a8b 2020-08-01 02:00:00 2020-08-01 07:00:00 3 Engaged No
16db61d2bc 2020-08-01 03:15:00 2020-08-01 03:15:00 1 Answered Yes
6355f7700d 2020-08-01 01:00:00 2020-08-01 06:00:00 3 Answered Yes
9b9fab9789 2020-08-01 04:15:00 2020-08-01 09:15:00 3 Answered Yes
...
So OrigTime
is the time of their first call within a single attempt, and LastTime
is the time of their last call within the same single attempt. The Calls
column counts the number of calls the User made within that attempt, Status
is the status of the last call within an attempt and "Successful" can be logical, saying whether the last call within that attempt was answered or not.
Any pointers in the right direction would be great. I imagine that there is some data.table
or dplyr
solution but I haven't done much of this kind of activity before, so not sure where to begin. Big thank you in advance :)
EDIT
@Waldi provided a solution which provided nearly what I needed. Here is the solution that works best so far (slightly amended from @Waldi 's answer):-
CondCount <- function(data,maxdelay){
result <- list()
row <- 0
calls <- 0
OrigTime <- NA
n <- nrow(data)
for (i in 1:n) {
if (is.na(OrigTime)) {
OrigTime <- data$DateTime[[i]]
calls <- 0
}
calls = calls + 1
if (data$Status[[i]] == "Answered" | difftime(data$DateTime[[i]],OrigTime,units='hours') > maxdelay | i==n) {
row <- row + 1
result[[row]] <- data.frame(OrigTime = OrigTime, LastTime = data$DateTime[[i]], calls = calls, Status = factor(data$Status[[i]],levels=c("Answered" ,"Abandoned" ,"Engaged","Unknown")), Successful = ifelse(data$Status[[i]]=="Answered",'Y','N') )
OrigTime <- NA
}
}
dplyr::bind_rows(result)
}
df %>% arrange(UserID,DateTime) %>%
split(.$UserID) %>%
map(function(data) {CondCount(data,1) }) %>%
bind_rows(.id="UserID")
Please see the 2 steps that I had written prior to my edit. This time around, the period is 1 hour and not 5 hour.
With @Waldi 's solution, here is when it works on my real df (I apologise for the colour coding that I have used, if by chance there are any colour-blind SO users):-
Correct outcome
and with @Waldi 's solution, it will give you this:-
which is correct! This is what I am aiming for. However, I want to illustrate two examples of what happens when I run this code, which gives the undesired outcome:-
Incorrect Outcome 1
which gives you this:-
This is incorrect. It should be two rows with one for each attempt (each with a final status of "Abandoned") and not one row, as the time difference between the last two rows is greater than 60 mins.
Incorrect Outcome 2
which gives you this:-
This is incorrect. It should be two rows with one for each attempt (first row with a status of "Engaged", second row with a status of "Answered").
I have to give massive credit to @Waldi because the solution works perfectly for answered calls. However, it does not take into consideration the other status types i.e. Abandoned and Engaged. It may be a case of not enough conditions being met for these two statuses. As always, any help would be appreciated!
You could use purrr
to split data by user, and use a simple for-loop function to implement the logic you're looking for :
library(purrr)
CondCount <- function(data,maxdelay){
result <- list()
row <- 0
calls <- 0
OrigTime <- NA
n <- nrow(data)
for (i in 1:n) {
if (is.na(OrigTime)) {
OrigTime <- data$DateTime[[i]]
calls <- 0
}
calls = calls + 1
if (difftime(data$DateTime[[i]],OrigTime,units='hours') > maxdelay) {
row <- row + 1
result[[row]] <- data.frame(OrigTime = OrigTime, LastTime = data$DateTime[[i-1]], calls = calls, Status = factor(data$Status[[i-1]],levels=c("Answered" ,"Abandoned" ,"Engaged")), Successful = ifelse(data$Status[[i]]=="Answered",'Y','N') )
OrigTime <- data$DateTime[[i]]
}
if ((data$Status[[i]] !="Engaged") | i == n) {
row <- row + 1
result[[row]] <- data.frame(OrigTime = OrigTime, LastTime = data$DateTime[[i]], calls = calls, Status = factor(data$Status[[i]],levels=c("Answered" ,"Abandoned" ,"Engaged")), Successful = ifelse(data$Status[[i]]=="Answered",'Y','N') )
OrigTime <- NA
}
}
dplyr::bind_rows(result)
}
df %>% arrange(UserID,DateTime) %>%
split(.$UserID) %>%
map(function(data) {CondCount(data,5) }) %>%
bind_rows(.id="UserID")
UserID OrigTime LastTime calls Status Successful
1 022098d3cf 2020-08-01 03:15:00 2020-08-01 03:15:00 1 Answered Y
2 022098d3cf 2020-08-01 05:45:00 2020-08-01 05:45:00 1 Answered Y
3 022098d3cf 2020-08-01 08:15:00 2020-08-01 08:15:00 1 Abandoned N
4 022098d3cf 2020-08-01 10:45:00 2020-08-01 10:45:00 1 Answered Y
5 022098d3cf 2020-08-01 13:15:00 2020-08-01 13:15:00 1 Abandoned N
6 022098d3cf 2020-08-01 15:45:00 2020-08-01 15:45:00 1 Abandoned N
7 022098d3cf 2020-08-01 18:15:00 2020-08-01 18:15:00 1 Abandoned N
8 18f13c3972 2020-08-01 01:15:00 2020-08-01 03:45:00 2 Abandoned N
9 18f13c3972 2020-08-01 06:15:00 2020-08-01 06:15:00 1 Answered Y
10 18f13c3972 2020-08-01 08:45:00 2020-08-01 13:45:00 3 Answered Y
If the loop needs to be very fast, it can easily be converted to Rcpp.
NB : for some reason, set.seed(123) doesn't seem to be enough to create reproducible results.