[SOLVED] Is there an efficient way (or a package) to dynamically filter a dataset by time between dates?

Is there an efficient way (or a package) to dynamically filter a dataset by time between dates?

Say there is a dataset with duplicated PersonIDs, and multiple dates.

PersonID	Date
1	2024-01-01
1	2024-01-02
1	2024-01-09
1	2024-01-15
2	2024-08-05
2	2024-08-06
3	2024-01-07
3	2024-01-08
3	2024-01-15

I want to keep only the first record for each PersonID, and filter out any records that is within a 3 day gap. Then, for the next record that is kept, remove any records that is within 3 days after that record as well. And reiterating that as many times as needed.

So in this example, only these will be left:

PersonID	Date
1	2024-01-01
1	2024-01-09
1	2024-01-15
2	2024-08-05
3	2024-01-07
3	2024-01-15

Is there an efficient way of doing that? Or do I have to create a column one step at a time?

Solution

This type of windowing operation could be done with the help of Reduce. For example

day_lag_fitler <- function(x, cutoff = 3) {
  Reduce(function(acc, current) {
    if (difftime(current, acc$ref, unit="days") < cutoff) {
      list(keep=FALSE, ref=acc$ref)
    } else {
      list(keep=TRUE, ref=current)
    }
  }, x[-1], init=list(keep=TRUE, ref=x[1]), accumulate = TRUE) |> sapply('[[', "keep")
}

We basically just keep a list of reference day and whether or not to keep the value and at the end we just peel off the keep part.

And then using dplyr you can apply it to each of the person groups

dd |>
  filter(day_lag_fitler(Date), .by=PersonID)
#   PersonID       Date
# 1        1 2024-01-01
# 2        1 2024-01-09
# 3        1 2024-01-15
# 4        2 2024-01-16
# 5        2 2024-09-01
# 6        3 2024-01-07
# 7        3 2024-01-15

Tested with

dd<- structure(list(PersonID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L
), Date = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0), 
    min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(1L, 2L, 9L, 15L, 
    16L, 1L, 7L, 8L, 15L), mon = c(0L, 0L, 0L, 0L, 0L, 8L, 0L, 
    0L, 0L), year = c(124L, 124L, 124L, 124L, 124L, 124L, 124L, 
    124L, 124L), wday = c(1L, 2L, 2L, 1L, 2L, 0L, 0L, 1L, 1L), 
    yday = c(0L, 1L, 8L, 14L, 15L, 244L, 6L, 7L, 14L), isdst = c(0L, 
    0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), zone = c("EST", "EST", "EST", 
    "EST", "EST", "EDT", "EST", "EST", "EST"), gmtoff = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt", 
"POSIXt"), tzone = c("", "EST", "EDT"), balanced = TRUE)), row.names = c(NA, 
-9L), class = "data.frame")