I have a dataset as follows
data<- data.frame(group_ID= c("cred", "cred", "cred", "cyellow", "cyellow", "cgreen"),
Start =c("2024-06-27,10:31:34", "2024-06-27,10:42:26", "2024-06-27,12:33:00", "2024-06-27,12:34:00","2024-06-27,12:54:00","2024-06-28,14:58:51"),
End = c("2024-06-27,11:31:34","2024-06-27,11:42:26","2024-06-27,13:33:00", "2024-06-27,13:34:00","2024-06-27,13:54:00","2024-06-28,15:54:14"),
priorityrank=c(1, 2, 2, 2, 3, 3),
goaltokeep = c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
Where each row is an observation of a group for a given time window. Sometimes multiple people observed the same group at the same time, so we give a "priorityrank" where lower numbers are the preferred observation, with 1 being the most preferred and 3 being the least preferred. Sometimes different groups are observed at the same time and this does not matter.
I want to keep the row with the smaller value "priorityrank" number for a given group when >1 observation time window overlaps. With the example data the "goal to keep" column shows the rows that should be retained (TRUE) in the final dataset.
I started out in trying to assess overlap for different rows by ID number, however, When I run the example data (mine or provided by the commenter) using the suggested code by Alex (hasn't logged in 3+ years so can't ask) I get the following error.
Error in eval_tidy(args[[j]], mask) :
no function to return from, jumping to top level
This SQL question gets close, but I don't know anything about SQL or how to do comparable in R.
Thanks in advance for your thoughts and help! I have been struggling for a while with this.
UPDATE:
Thank you all so much! I am having an unexpected error throw in about 60 of the 1000 cases with both possible solutions when I use the larger dataframe. Even when I sort my data frame by start time these errors persist.
I isolated one of these instances that throws the error, and here it is reproduced below as an example to work with. I am not sure what is wrong since the start does come before the end for any given row?
dataerr<- data.frame(group_ID= c("cred", "cred", "cred"),
Start =c("2024-06-27,12:33:00","2024-06-27,13:26:57","2024-06-27,13:33:36"),
End = c("2024-06-27,13:33:00","2024-06-27,13:48:49","2024-06-27,13:49:42"),
priorityrank=c(2, 3, 2),
goaltokeep = c(TRUE, FALSE, TRUE))
Here are the errors for each solution. I am not sure how to interpret them even after searching online for a bit as the starts seem to be before the ends? Maybe it is about something between lines?
Error Friede solution:
Error in `iv()`:
! `start` must be less than `end`.
ℹ `start` is not less than `end` at locations: `2`.
Error r2evans solution:
Error in `[[<-.data.frame`(`*tmp*`, col, value = c("cred", "cred", "cred", :
replacement has 6 rows, data has 3
Since you are looking for a {dplyr}
-solution, I suggest incorparating the powerful {ivs}
.
library(ivs)
library(dplyr)
# (1) clean data
data =
data |>
mutate(Start = strptime(Start, '%F,%T'), # %Y-%m-%d, %H:%M:%S
End = strptime(End, '%F,%T')) |>
arrange(group_ID, Start, End)
# (2) desired filtering
data |>
mutate(iv = iv_identify_group(iv(Start, End)), .by=group_ID) |>
filter(priorityrank == min(priorityrank), .by=c(iv, group_ID)) |>
select(-iv) # or -c(iv, goaltokeep) optional
group_ID Start End priorityrank goaltokeep
1 cgreen 2024-06-28 14:58:51 2024-06-28 15:54:14 3 TRUE
2 cred 2024-06-27 10:31:34 2024-06-27 11:31:34 1 TRUE
3 cred 2024-06-27 12:33:00 2024-06-27 13:33:00 2 TRUE
4 cred 2024-06-27 13:33:36 2024-06-27 13:49:42 2 TRUE
5 cyellow 2024-06-27 12:34:00 2024-06-27 13:34:00 2 TRUE