Obtaining regular-spaced observations from irregular time series

I have a dataset with multiple observations per subject over time. The dates for the first and last observation per individual are different across the dataset, but for each individual there is only one observation per day, as in the example below:


data  <- data.frame(dates = lubridate::as_date(c(
  "2016-04-25", "2016-04-26", "2016-04-28", "2016-04-29", 
  "2016-05-02", "2016-05-03", "2016-05-04", "2016-05-05", 
  "2016-05-08", "2016-05-09", "2016-05-10", "2016-05-11", 
  "2016-05-13", "2016-05-16", "2016-06-10",
  "2020-11-25", "2020-09-20", "2020-09-17", "2020-09-14", 
  "2020-09-11", "2020-09-10", "2020-09-03", "2020-08-27", 
  "2020-08-26", "2020-08-04", "2020-08-01", "2020-07-28", 
  "2020-05-05", "2020-03-25", "2013-10-16", "2013-08-20"
  
)), 
vals = runif(31, min = 50, max = 300),
subject = c(rep(1, 15), 
            rep(2, 16)))

What I want to do is to obtain the sequences of observations where each observation is separated by a certain amount of time, say 3 days, and then keep the one that is the longest. The challenge is that because the observations are irregular over time, this requires examining adjacent and neighboring observations to see if the 3-day gap is present. Also, the start of the sequence can be anywhere within the observations of the same individual, and one individual can have more than one sequence. A condition is that one should start to examine the sequence from the oldest date. For the example above, the longest sequence of observations with a 3-day gap to be obtained from subject 1 would have the dates: 2016-04-26, 2016-04-29, 2016-05-02, 2016-05-05, 2016-05-08, 2016-05-11. There is a shorter sequence for subject 1 2016-05-13, 2016-05-16 but only the longest one would be kept in the end. And for subject 2 the sampled observations would have the dates: 2020-09-20, 2020-09-17, 2020-09-14, 2020-09-11.

I have tried the ideas shown here but I think what I am trying to do is more complex as it will involve using some type of sliding window that moves along the column. I would be thankful is someone has some ideas on how to approach this.

Solution

Assuming "exactly 3" for your desired gaps, try this:

fun <- function(x, gap = 3) {
  maxx <- max(x)
  xsorted <- sort(x)
  subs <- lapply(seq_along(xsorted)[-length(xsorted)], function(i) {
    Reduce(function(prev, this) if ((this - tail(prev, 1)) == gap) c(prev, this) else prev,
           xsorted[i:length(xsorted)])
  })
  x %in% subs[[ which.max(lengths(subs)) ]]
}

subset(data, ave(as.integer(dates), subject, FUN = fun) > 0)
#         dates     vals subject
# 2  2016-04-26 128.4128       1
# 4  2016-04-29 280.2111       1
# 5  2016-05-02 160.2425       1
# 8  2016-05-05 147.7424       1
# 9  2016-05-08 295.8312       1
# 12 2016-05-11 250.1512       1
# 17 2020-09-20 130.1142       2
# 18 2020-09-17 276.8538       2
# 19 2020-09-14 275.2845       2
# 20 2020-09-11 127.5926       2

THe function chooses to return logical the same length as its input, this makes it easy to use subset() (or dplyr::filter()), so therefore it is easy to retain all of the other columns. stats::ave() has two inconvenient "features": (a) it makes the output the same class as the first input; and (b) it doesn't know how to return a class of Date, double problem. To circumvent this, I convert the dates to integer first, which means no Date-conversion, and when fun() returns a logical, ave() will convert it to 0s (false) and 1s (true).

If you're using dplyr, the grouping is a little more natural.

library(dplyr)
filter(data, .by = subject, fun(dates))
#         dates     vals subject
# 1  2016-04-26 128.4128       1
# 2  2016-04-29 280.2111       1
# 3  2016-05-02 160.2425       1
# 4  2016-05-05 147.7424       1
# 5  2016-05-08 295.8312       1
# 6  2016-05-11 250.1512       1
# 7  2020-09-20 130.1142       2
# 8  2020-09-17 276.8538       2
# 9  2020-09-14 275.2845       2
# 10 2020-09-11 127.5926       2

I can't promise that this will scale well, likely a best O(m*n*log(n)) or O(m*n*sqrt(n)) (where m is the number of subjects and n is the number of observations per subject). I'm not sure which atm, and it's late ...