rdplyrtimegroupingintervals

Group rows by group & overlapping time intervals & then keep highest priority row in R


I have a dataset as follows

data<- data.frame(group_ID= c("cred", "cred", "cred", "cyellow", "cyellow", "cgreen"), 
                  Start =c("2024-06-27,10:31:34", "2024-06-27,10:42:26", "2024-06-27,12:33:00", "2024-06-27,12:34:00","2024-06-27,12:54:00","2024-06-28,14:58:51"), 
End = c("2024-06-27,11:31:34","2024-06-27,11:42:26","2024-06-27,13:33:00", "2024-06-27,13:34:00","2024-06-27,13:54:00","2024-06-28,15:54:14"),
priorityrank=c(1, 2, 2, 2, 3, 3),
goaltokeep = c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))

Where each row is an observation of a group for a given time window. Sometimes multiple people observed the same group at the same time, so we give a "priorityrank" where lower numbers are the preferred observation, with 1 being the most preferred and 3 being the least preferred. Sometimes different groups are observed at the same time and this does not matter.

I want to keep the row with the smaller value "priorityrank" number for a given group when >1 observation time window overlaps. With the example data the "goal to keep" column shows the rows that should be retained (TRUE) in the final dataset.

I started out in trying to assess overlap for different rows by ID number, however, When I run the example data (mine or provided by the commenter) using the suggested code by Alex (hasn't logged in 3+ years so can't ask) I get the following error.

Error in eval_tidy(args[[j]], mask) : 
  no function to return from, jumping to top level

This SQL question gets close, but I don't know anything about SQL or how to do comparable in R.

Thanks in advance for your thoughts and help! I have been struggling for a while with this.


UPDATE:

Thank you all so much! I am having an unexpected error throw in about 60 of the 1000 cases with both possible solutions when I use the larger dataframe. Even when I sort my data frame by start time these errors persist.

I isolated one of these instances that throws the error, and here it is reproduced below as an example to work with. I am not sure what is wrong since the start does come before the end for any given row?

dataerr<- data.frame(group_ID= c("cred", "cred", "cred"), 
                  Start =c("2024-06-27,12:33:00","2024-06-27,13:26:57","2024-06-27,13:33:36"), 
End = c("2024-06-27,13:33:00","2024-06-27,13:48:49","2024-06-27,13:49:42"),
priorityrank=c(2, 3, 2),
goaltokeep = c(TRUE, FALSE, TRUE))

Here are the errors for each solution. I am not sure how to interpret them even after searching online for a bit as the starts seem to be before the ends? Maybe it is about something between lines?

Error Friede solution:

Error in `iv()`:
! `start` must be less than `end`.
ℹ `start` is not less than `end` at locations: `2`.

Error r2evans solution:

Error in `[[<-.data.frame`(`*tmp*`, col, value = c("cred", "cred", "cred",  : 
  replacement has 6 rows, data has 3

Solution

  • Since you are looking for a {dplyr}-solution, I suggest incorparating the powerful {ivs}.

    library(ivs)
    library(dplyr)
    # (1) clean data 
    data = 
      data |>
      mutate(Start = strptime(Start, '%F,%T'), # %Y-%m-%d, %H:%M:%S
             End = strptime(End, '%F,%T')) |>
      arrange(group_ID, Start, End)
    # (2) desired filtering 
    data |>
      mutate(iv = iv_identify_group(iv(Start, End)), .by=group_ID) |>
      filter(priorityrank == min(priorityrank), .by=c(iv, group_ID)) |>
      select(-iv) # or -c(iv, goaltokeep) optional
    
      group_ID               Start                 End priorityrank goaltokeep
    1   cgreen 2024-06-28 14:58:51 2024-06-28 15:54:14            3       TRUE
    2     cred 2024-06-27 10:31:34 2024-06-27 11:31:34            1       TRUE
    3     cred 2024-06-27 12:33:00 2024-06-27 13:33:00            2       TRUE
    4     cred 2024-06-27 13:33:36 2024-06-27 13:49:42            2       TRUE
    5  cyellow 2024-06-27 12:34:00 2024-06-27 13:34:00            2       TRUE