Say there is a dataset with duplicated PersonIDs, and multiple dates.
PersonID | Date |
---|---|
1 | 2024-01-01 |
1 | 2024-01-02 |
1 | 2024-01-09 |
1 | 2024-01-15 |
2 | 2024-08-05 |
2 | 2024-08-06 |
3 | 2024-01-07 |
3 | 2024-01-08 |
3 | 2024-01-15 |
I want to keep only the first record for each PersonID, and filter out any records that is within a 3 day gap. Then, for the next record that is kept, remove any records that is within 3 days after that record as well. And reiterating that as many times as needed.
So in this example, only these will be left:
PersonID | Date |
---|---|
1 | 2024-01-01 |
1 | 2024-01-09 |
1 | 2024-01-15 |
2 | 2024-08-05 |
3 | 2024-01-07 |
3 | 2024-01-15 |
Is there an efficient way of doing that? Or do I have to create a column one step at a time?
This type of windowing operation could be done with the help of Reduce
. For example
day_lag_fitler <- function(x, cutoff = 3) {
Reduce(function(acc, current) {
if (difftime(current, acc$ref, unit="days") < cutoff) {
list(keep=FALSE, ref=acc$ref)
} else {
list(keep=TRUE, ref=current)
}
}, x[-1], init=list(keep=TRUE, ref=x[1]), accumulate = TRUE) |> sapply('[[', "keep")
}
We basically just keep a list of reference day and whether or not to keep the value and at the end we just peel off the keep part.
And then using dplyr
you can apply it to each of the person groups
dd |>
filter(day_lag_fitler(Date), .by=PersonID)
# PersonID Date
# 1 1 2024-01-01
# 2 1 2024-01-09
# 3 1 2024-01-15
# 4 2 2024-01-16
# 5 2 2024-09-01
# 6 3 2024-01-07
# 7 3 2024-01-15
Tested with
dd<- structure(list(PersonID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L
), Date = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(1L, 2L, 9L, 15L,
16L, 1L, 7L, 8L, 15L), mon = c(0L, 0L, 0L, 0L, 0L, 8L, 0L,
0L, 0L), year = c(124L, 124L, 124L, 124L, 124L, 124L, 124L,
124L, 124L), wday = c(1L, 2L, 2L, 1L, 2L, 0L, 0L, 1L, 1L),
yday = c(0L, 1L, 8L, 14L, 15L, 244L, 6L, 7L, 14L), isdst = c(0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), zone = c("EST", "EST", "EST",
"EST", "EST", "EDT", "EST", "EST", "EST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt",
"POSIXt"), tzone = c("", "EST", "EDT"), balanced = TRUE)), row.names = c(NA,
-9L), class = "data.frame")