I have a dataset of messages being sent between two people (user A and user B) on different chats over time. I need to figure out how to track the datetime of the last but earliest user A message datetime in relation to the next messages from user B, regardless of how many sequential back-to-back messages are sent by user B. For example, let's say I have the following data:
chatID userID message_date
1 A 2023-10-11 14:32:39
1 B 2023-10-11 14:34:14
1 A 2023-10-11 14:37:22
1 A 2023-10-11 14:38:48
1 B 2023-10-11 14:42:07
1 A 2023-10-11 14:43:58
1 B 2023-10-11 14:45:36
1 B 2023-10-11 14:46:11
1 A 2023-10-11 14:50:08
1 B 2023-10-11 14:52:17
2 A 2023-10-17 09:10:28
2 A 2023-10-17 09:11:54
2 A 2023-10-17 09:12:36
2 B 2023-10-17 09:18:47
2 B 2023-10-17 09:19:22
2 A 2023-10-17 09:22:03
2 B 2023-10-17 09:24:50
2 B 2023-10-17 09:28:16
2 A 2023-10-17 09:32:07
2 A 2023-10-17 09:33:59
2 A 2023-10-17 09:34:09
2 B 2023-10-17 09:40:21
2 B 2023-10-17 09:41:47
2 B 2023-10-17 09:42:03
2 A 2023-10-17 09:44:48
2 B 2023-10-17 09:45:57
2 B 2023-10-17 09:46:13
2 A 2023-10-17 09:52:43
2 B 2023-10-17 09:54:01
What I am hoping to calculate is this:
chatID userID message_date message_date_earliest_priorA
1 A 2023-10-11 14:32:39 <NA>
1 B 2023-10-11 14:34:14 2023-10-11 14:32:39
1 A 2023-10-11 14:37:22 <NA>
1 A 2023-10-11 14:38:48 <NA>
1 B 2023-10-11 14:42:07 2023-10-11 14:37:22
1 A 2023-10-11 14:43:58 <NA>
1 B 2023-10-11 14:45:36 2023-10-11 14:43:58
1 B 2023-10-11 14:46:11 2023-10-11 14:43:58
1 A 2023-10-11 14:50:08 <NA>
1 B 2023-10-11 14:52:17 2023-10-11 14:50:08
2 A 2023-10-17 09:10:28 <NA>
2 A 2023-10-17 09:11:54 <NA>
2 A 2023-10-17 09:12:36 <NA>
2 B 2023-10-17 09:18:47 2023-10-17 09:10:28
2 B 2023-10-17 09:19:22 2023-10-17 09:10:28
2 A 2023-10-17 09:22:03 <NA>
2 B 2023-10-17 09:24:50 2023-10-17 09:22:03
2 B 2023-10-17 09:28:16 2023-10-17 09:22:03
2 A 2023-10-17 09:32:07 <NA>
2 A 2023-10-17 09:33:59 <NA>
2 A 2023-10-17 09:34:09 <NA>
2 B 2023-10-17 09:40:21 2023-10-17 09:32:07
2 B 2023-10-17 09:41:47 2023-10-17 09:32:07
2 A 2023-10-17 09:44:48 <NA>
2 B 2023-10-17 09:45:57 2023-10-17 09:44:48
2 B 2023-10-17 09:46:13 2023-10-17 09:44:48
2 A 2023-10-17 09:52:43 <NA>
2 B 2023-10-17 09:54:01 2023-10-17 09:52:43
Where if user A sends a message at 2023-10-17 09:32:07, followed by two more messages before userB responds, then the userB responses get assigned the 2023-10-17 09:32:07 value
I was able to find a solution with the following code (a mix of solutions from two responses from this post: R calculate timestamp of last message based on user ID of who sent message)
# structure of the data
data = structure(list(chatID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), userID = c("A",
"B", "A", "A", "B", "A", "B", "B", "A", "B", "A", "A", "A", "B",
"B", "A", "B", "B", "A", "A", "A", "B", "B", "B", "A", "B", "B",
"A", "B"), message_date = structure(c(1697034759, 1697034854,
1697035042, 1697035128, 1697035327, 1697035438, 1697035536, 1697035571,
1697035808, 1697035937, 1697533828, 1697533914, 1697533956, 1697534327,
1697534362, 1697534523, 1697534690, 1697534896, 1697535127, 1697535239,
1697535249, 1697535621, 1697535707, 1697535723, 1697535888, 1697535957,
1697535973, 1697536363, 1697536441), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), row.names = c(NA, -29L), class = c("tbl_df",
"tbl", "data.frame"))
# add column notating userID with numeric values as a difference from -1
# so first A message gets -1, subsequent A messages get 0, first B message gets 1, subsequent B messages get 0
data$d = c(-1, diff(as.integer(as.factor(data$userID))))
# subset to first A messages (where d== -1)
# make the userID B instead of A so we can merge back
# rename message_date to keep track
# keep only the columns we need
data_firstA = data %>%
filter(d==(-1)) %>%
mutate(userID="B") %>%
rename(message_date_earliest_priorA=message_date) %>%
dplyr::select(chatID, userID, message_date_earliest_priorA)
# left join the subset data by chatID and userID
# and using closest() to pair the closest dates where message_date is >= the previous, earliest and most recent A message_date
data_final = data %>%
left_join(data_firstA,
by=join_by(chatID, userID, closest(message_date>=message_date_earliest_priorA)))
data_final
But I feel like there's got to be a cleaner way of doing this. My actual dataset is around 3 million rows with a lot of different chatIDs. Anyone have ideas on how to make this process more efficient?
1) dplyr We can code this in a single mutate
. First use consecutive_id
to create column g
which labels all rows in the first run as 1, all rows in the second run as 2 and so on. Then on each row use match
to find the first row with run g-1. Next NA out those rows whose userID
is that of the first row and finally remove g
. One variation is to replace match
with the drop-in replacement fmatch
from the fastmatch package.
library(dplyr)
data %>%
mutate(g = consecutive_id(userID),
earliest_prev = replace(message_date[match(g-1,g)], userID==first(userID), NA),
g = NULL)
giving
# A tibble: 29 × 4
chatID userID message_date earliest_prev
<dbl> <chr> <dttm> <dttm>
1 1 A 2023-10-11 14:32:39 NA
2 1 B 2023-10-11 14:34:14 2023-10-11 14:32:39
3 1 A 2023-10-11 14:37:22 NA
4 1 A 2023-10-11 14:38:48 NA
5 1 B 2023-10-11 14:42:07 2023-10-11 14:37:22
6 1 A 2023-10-11 14:43:58 NA
7 1 B 2023-10-11 14:45:36 2023-10-11 14:43:58
8 1 B 2023-10-11 14:46:11 2023-10-11 14:43:58
9 1 A 2023-10-11 14:50:08 NA
10 1 B 2023-10-11 14:52:17 2023-10-11 14:50:08
# ℹ 19 more rows
# ℹ Use `print(n = ...)` to see more rows
2) Base R This can be readily translated to Base R. Define rleid
to be like consecutive_id
above and then use the same processing with within
instead of mutate
.
rleid <- function(x) with(rle(x), rep(seq_along(values), lengths))
data |>
within({
g = rleid(userID)
earliest_prev = replace(message_date[match(g-1,g)], userID==userID[1], NA)
g = NULL
})