R: difftime with lag to create another column

I have a dataset where each row is a cluster, and I want to know which clusters fall within the previous cluster, or 12h after the previous cluster ended.

My dataset looks like this:

cluster_id  period_id   clus_start          clus_end
36          SAT119_1    2011-12-08 14:19    2011-12-10 14:18
37          SAT119_1    2011-12-08 18:18    2011-12-10 06:20
40          SAT119_1    2011-12-14 22:19    2011-12-16 22:18
41          SAT119_1    2011-12-17 22:18    2011-12-18 18:18
42          SAT119_1    2011-12-19 05:19    2011-12-30 14:18
47          SAT119_1    2011-12-30 18:19    2012-01-04 22:19
49          SAT119_1    2012-01-06 06:18    2012-01-07 22:19
59          SAT119_1    2012-01-20 02:19    2012-01-21 22:18
61          SAT119_1    2012-01-22 18:18    2012-01-24 14:18
63          SAT119_1    2012-01-26 06:18    2012-01-28 14:20

What I want is to check, sequentially, if the start of cluster n+1 falls within the end (or 12h later) of cluster n, and I want to create a new column with a yes/no option. The first cluster of each period id would always have "no", and then it would be either "yes" or "no". But, at the same time, IF cluster n+1 falls within cluster n, I would like cluster n+2 to be checked in relation to cluster n.

In order words, if overlap (see table below) is yes, then that cluster should be ignored to calculate the time difference for the next cluster.

So, the output in this table would be:

cluster_id  period_id   clus_start          clus_end          overlap time_diff (days)
36          SAT119_1    2011-12-08 14:19    2011-12-10 14:18  no      NA
37          SAT119_1    2011-12-08 18:18    2011-12-10 06:20  yes     start_37 - end_36
40          SAT119_1    2011-12-14 22:19    2011-12-16 22:18  no      start_40 - end_36
41          SAT119_1    2011-12-17 22:18    2011-12-18 18:18  no      start_41 - end_40
42          SAT119_1    2011-12-19 05:19    2011-12-30 14:18  yes     start_42 - end_41
47          SAT119_1    2011-12-30 18:19    2012-01-04 22:19  no      start_47 - end_41
49          SAT119_1    2012-01-06 06:18    2012-01-07 22:19  no      start_49 - end_47
59          SAT119_1    2012-01-20 02:19    2012-01-21 22:18  no      start_59 - end_49
61          SAT119_1    2012-01-22 18:18    2012-01-24 14:18  no      start_61 - end_59
63          SAT119_1    2012-01-26 06:18    2012-01-28 14:20  no      start_63 - end_61

(Where start_37 and so on indicates the "clus_start" of cluster id 37)

I believe that, at first, the time has to be calculated for the first two rows, then the value in the column "within" is filled, and then the next calculations will be made considering the value (yes/no) on that column. But I am a bit stuck, so far I could only apply the difftime() function to get the times between the clusters:

 df$time_diff <- as.numeric(with(df, difftime(clus_start, ave(clus_end, period_id,FUN=lag), units='days')))

But this does not account for the yes/no column... Any idea how to incorporate this step?

Here is an example of the dataset:

structure(list(clus_start = structure(c(1323353941, 1323368281, 
1323901141, 1324160281, 1324275541, 1325269141, 1325830681, 1327025941, 
1327256281, 1327558681), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
    clus_end = structure(c(1323526681, 1323498001, 1324073881, 
    1324232281, 1325254681, 1325715541, 1325974741, 1327184281, 
    1327414681, 1327760401), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
    cluster_id = c("36", "37", "40", 
    "41", "42", "47", "49", "59", 
    "61", "63"), period_id = c("SAT119_1", "SAT119_1", 
    "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", 
    "SAT119_1", "SAT119_1", "SAT119_1")), row.names = c(2L, 3L, 
4L, 5L, 6L, 8L, 9L, 13L, 14L, 15L), class = "data.frame")

Solution

You can produce an output matching your desired format, where:

First cluster in each period has no time difference (NA)
Overlapping clusters are marked with "yes"
Time differences are calculated relative to the appropriate reference cluster

df <- structure(list(clus_start = structure(c(1323353941, 1323368281, 
                                        1323901141, 1324160281, 1324275541, 1325269141, 1325830681, 1327025941, 
                                        1327256281, 1327558681), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
               clus_end = structure(c(1323526681, 1323498001, 1324073881, 
                                      1324232281, 1325254681, 1325715541, 1325974741, 1327184281, 
                                      1327414681, 1327760401), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
               cluster_id = c("36", "37", "40", 
                              "41", "42", "47", "49", "59", 
                              "61", "63"), period_id = c("SAT119_1", "SAT119_1", 
                                                         "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", 
                                                         "SAT119_1", "SAT119_1", "SAT119_1")), row.names = c(2L, 3L, 
                                                                                                             4L, 5L, 6L, 8L, 9L, 13L, 14L, 15L), class = "data.frame")#




analyze_clusters <- function(df) {
  # Convert time columns to POSIXct if they aren't already
  df$clus_start <- as.POSIXct(df$clus_start, tz = "GMT")
  df$clus_end <- as.POSIXct(df$clus_end, tz = "GMT")
  
  # Initialize result columns
  df$overlap <- "no"
  df$time_diff <- NA
  
  # Process each period_id separately
  result <- df %>%
    group_by(period_id) %>%
    group_modify(~{
      dat <- .x
      n <- nrow(dat)
      
      # Reference cluster index (the one we're comparing against)
      ref_idx <- 1
      
      for(i in 2:n) {
        # Add 12 hours to the reference end time
        ref_end_plus_12h <- dat$clus_end[ref_idx] + 12 * 3600
        print(paste0("Ref: ",ref_end_plus_12h, " for i: ", i))
        # Check if current cluster starts within reference cluster (including 12h buffer)
        if(dat$clus_start[i] <= ref_end_plus_12h) {
          print(paste0(dat$clus_start[i], " <= ", ref_end_plus_12h, "!"))
          print("Overlap-> yes")
          dat$overlap[i] <- "yes"
          # Calculate time difference from reference cluster
          dat$time_diff[i] <- as.numeric(
            difftime(dat$clus_start[i], dat$clus_end[ref_idx], units = "days")
          )
        } else {
          print("Overlap-> no")
          # If no overlap, calculate time difference and update reference
          dat$time_diff[i] <- as.numeric(
            difftime(dat$clus_start[i], dat$clus_end[ref_idx], units = "days")
          )
          ref_idx <- i
          print(paste0("New ref index: ", i))
        }
      }
      dat
    }) %>%
    ungroup()
  
  return(result)
}
res <- analyze_clusters(df)

But then my result looks different from yours:

   period_id clus_start          clus_end            cluster_id overlap time_diff
   <chr>     <dttm>              <dttm>              <chr>      <chr>       <dbl>
 1 SAT119_1  2011-12-08 14:19:01 2011-12-10 14:18:01 36         no         NA    
 2 SAT119_1  2011-12-08 18:18:01 2011-12-10 06:20:01 37         yes        -1.83 
 3 SAT119_1  2011-12-14 22:19:01 2011-12-16 22:18:01 40         no          4.33 
 4 SAT119_1  2011-12-17 22:18:01 2011-12-18 18:18:01 41         no          1    
 5 SAT119_1  2011-12-19 06:19:01 2011-12-30 14:18:01 42         no          0.501
 6 SAT119_1  2011-12-30 18:19:01 2012-01-04 22:19:01 47         yes         0.167
 7 SAT119_1  2012-01-06 06:18:01 2012-01-07 22:19:01 49         no          6.67 
 8 SAT119_1  2012-01-20 02:19:01 2012-01-21 22:18:01 59         no         12.2  
 9 SAT119_1  2012-01-22 18:18:01 2012-01-24 14:18:01 61         no          0.833
10 SAT119_1  2012-01-26 06:18:01 2012-01-28 14:20:01 63         no          1.67