rdatetimedplyrtimestamprolling-computation

R calculate timestamp of earliest sequential prior message based on user ID of who sent message


I have a dataset of messages being sent between two people (user A and user B) on different chats over time. I need to figure out how to track the datetime of the last but earliest user A message datetime in relation to the next messages from user B, regardless of how many sequential back-to-back messages are sent by user B. For example, let's say I have the following data:

chatID  userID  message_date
1   A   2023-10-11 14:32:39
1   B   2023-10-11 14:34:14
1   A   2023-10-11 14:37:22
1   A   2023-10-11 14:38:48
1   B   2023-10-11 14:42:07
1   A   2023-10-11 14:43:58
1   B   2023-10-11 14:45:36
1   B   2023-10-11 14:46:11
1   A   2023-10-11 14:50:08
1   B   2023-10-11 14:52:17
2   A   2023-10-17 09:10:28
2   A   2023-10-17 09:11:54
2   A   2023-10-17 09:12:36
2   B   2023-10-17 09:18:47
2   B   2023-10-17 09:19:22
2   A   2023-10-17 09:22:03
2   B   2023-10-17 09:24:50
2   B   2023-10-17 09:28:16
2   A   2023-10-17 09:32:07
2   A   2023-10-17 09:33:59
2   A   2023-10-17 09:34:09
2   B   2023-10-17 09:40:21
2   B   2023-10-17 09:41:47
2   B   2023-10-17 09:42:03
2   A   2023-10-17 09:44:48
2   B   2023-10-17 09:45:57
2   B   2023-10-17 09:46:13
2   A   2023-10-17 09:52:43
2   B   2023-10-17 09:54:01

What I am hoping to calculate is this:

chatID  userID     message_date message_date_earliest_priorA
1   A   2023-10-11 14:32:39     <NA>
1   B   2023-10-11 14:34:14     2023-10-11 14:32:39
1   A   2023-10-11 14:37:22     <NA>
1   A   2023-10-11 14:38:48     <NA>
1   B   2023-10-11 14:42:07     2023-10-11 14:37:22
1   A   2023-10-11 14:43:58     <NA>
1   B   2023-10-11 14:45:36     2023-10-11 14:43:58
1   B   2023-10-11 14:46:11     2023-10-11 14:43:58
1   A   2023-10-11 14:50:08     <NA>
1   B   2023-10-11 14:52:17     2023-10-11 14:50:08
2   A   2023-10-17 09:10:28     <NA>
2   A   2023-10-17 09:11:54     <NA>
2   A   2023-10-17 09:12:36     <NA>
2   B   2023-10-17 09:18:47     2023-10-17 09:10:28
2   B   2023-10-17 09:19:22     2023-10-17 09:10:28
2   A   2023-10-17 09:22:03     <NA>
2   B   2023-10-17 09:24:50     2023-10-17 09:22:03
2   B   2023-10-17 09:28:16     2023-10-17 09:22:03
2   A   2023-10-17 09:32:07     <NA>
2   A   2023-10-17 09:33:59     <NA>
2   A   2023-10-17 09:34:09     <NA>
2   B   2023-10-17 09:40:21     2023-10-17 09:32:07
2   B   2023-10-17 09:41:47     2023-10-17 09:32:07
2   A   2023-10-17 09:44:48     <NA>
2   B   2023-10-17 09:45:57     2023-10-17 09:44:48
2   B   2023-10-17 09:46:13     2023-10-17 09:44:48
2   A   2023-10-17 09:52:43     <NA>
2   B   2023-10-17 09:54:01     2023-10-17 09:52:43

Where if user A sends a message at 2023-10-17 09:32:07, followed by two more messages before userB responds, then the userB responses get assigned the 2023-10-17 09:32:07 value

I was able to find a solution with the following code (a mix of solutions from two responses from this post: R calculate timestamp of last message based on user ID of who sent message)

# structure of the data
data = structure(list(chatID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), userID = c("A", 
"B", "A", "A", "B", "A", "B", "B", "A", "B", "A", "A", "A", "B", 
"B", "A", "B", "B", "A", "A", "A", "B", "B", "B", "A", "B", "B", 
"A", "B"), message_date = structure(c(1697034759, 1697034854, 
1697035042, 1697035128, 1697035327, 1697035438, 1697035536, 1697035571, 
1697035808, 1697035937, 1697533828, 1697533914, 1697533956, 1697534327, 
1697534362, 1697534523, 1697534690, 1697534896, 1697535127, 1697535239, 
1697535249, 1697535621, 1697535707, 1697535723, 1697535888, 1697535957, 
1697535973, 1697536363, 1697536441), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), row.names = c(NA, -29L), class = c("tbl_df", 
"tbl", "data.frame"))

# add column notating userID with numeric values as a difference from -1
# so first A message gets -1, subsequent A messages get 0, first B message gets 1, subsequent B messages get 0
data$d = c(-1, diff(as.integer(as.factor(data$userID))))

# subset to first A messages (where d== -1)
# make the userID B instead of A so we can merge back
# rename message_date to keep track
# keep only the columns we need
data_firstA = data %>% 
  filter(d==(-1)) %>% 
  mutate(userID="B") %>%
  rename(message_date_earliest_priorA=message_date) %>%
  dplyr::select(chatID, userID, message_date_earliest_priorA)

# left join the subset data by chatID and userID
# and using closest() to pair the closest dates where message_date is >= the previous, earliest and most recent A message_date  
data_final = data %>% 
  left_join(data_firstA,
            by=join_by(chatID, userID, closest(message_date>=message_date_earliest_priorA)))

data_final

But I feel like there's got to be a cleaner way of doing this. My actual dataset is around 3 million rows with a lot of different chatIDs. Anyone have ideas on how to make this process more efficient?


Solution

  • 1) dplyr We can code this in a single mutate. First use consecutive_id to create column g which labels all rows in the first run as 1, all rows in the second run as 2 and so on. Then on each row use match to find the first row with run g-1. Next NA out those rows whose userID is that of the first row and finally remove g. One variation is to replace match with the drop-in replacement fmatch from the fastmatch package.

    library(dplyr)
    
    data %>%
      mutate(g = consecutive_id(userID),
       earliest_prev = replace(message_date[match(g-1,g)], userID==first(userID), NA),
       g = NULL)
    

    giving

    # A tibble: 29 × 4
       chatID userID message_date        earliest_prev      
        <dbl> <chr>  <dttm>              <dttm>             
     1      1 A      2023-10-11 14:32:39 NA                 
     2      1 B      2023-10-11 14:34:14 2023-10-11 14:32:39
     3      1 A      2023-10-11 14:37:22 NA                 
     4      1 A      2023-10-11 14:38:48 NA                 
     5      1 B      2023-10-11 14:42:07 2023-10-11 14:37:22
     6      1 A      2023-10-11 14:43:58 NA                 
     7      1 B      2023-10-11 14:45:36 2023-10-11 14:43:58
     8      1 B      2023-10-11 14:46:11 2023-10-11 14:43:58
     9      1 A      2023-10-11 14:50:08 NA                 
    10      1 B      2023-10-11 14:52:17 2023-10-11 14:50:08
    # ℹ 19 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    2) Base R This can be readily translated to Base R. Define rleid to be like consecutive_id above and then use the same processing with within instead of mutate.

    rleid <- function(x) with(rle(x), rep(seq_along(values), lengths))
    
    data |>
      within({
        g = rleid(userID)
        earliest_prev = replace(message_date[match(g-1,g)], userID==userID[1], NA)
        g = NULL
      })