I have a dataset where each row is a cluster, and I want to know which clusters fall within the previous cluster, or 12h after the previous cluster ended.
My dataset looks like this:
cluster_id period_id clus_start clus_end
36 SAT119_1 2011-12-08 14:19 2011-12-10 14:18
37 SAT119_1 2011-12-08 18:18 2011-12-10 06:20
40 SAT119_1 2011-12-14 22:19 2011-12-16 22:18
41 SAT119_1 2011-12-17 22:18 2011-12-18 18:18
42 SAT119_1 2011-12-19 05:19 2011-12-30 14:18
47 SAT119_1 2011-12-30 18:19 2012-01-04 22:19
49 SAT119_1 2012-01-06 06:18 2012-01-07 22:19
59 SAT119_1 2012-01-20 02:19 2012-01-21 22:18
61 SAT119_1 2012-01-22 18:18 2012-01-24 14:18
63 SAT119_1 2012-01-26 06:18 2012-01-28 14:20
What I want is to check, sequentially, if the start of cluster n+1
falls within the end (or 12h later) of cluster n
, and I want to create a new column with a yes/no option. The first cluster of each period id would always have "no", and then it would be either "yes" or "no".
But, at the same time, IF cluster n+1
falls within cluster n
, I would like cluster n+2
to be checked in relation to cluster n
.
In order words, if overlap (see table below) is yes, then that cluster should be ignored to calculate the time difference for the next cluster.
So, the output in this table would be:
cluster_id period_id clus_start clus_end overlap time_diff (days)
36 SAT119_1 2011-12-08 14:19 2011-12-10 14:18 no NA
37 SAT119_1 2011-12-08 18:18 2011-12-10 06:20 yes start_37 - end_36
40 SAT119_1 2011-12-14 22:19 2011-12-16 22:18 no start_40 - end_36
41 SAT119_1 2011-12-17 22:18 2011-12-18 18:18 no start_41 - end_40
42 SAT119_1 2011-12-19 05:19 2011-12-30 14:18 yes start_42 - end_41
47 SAT119_1 2011-12-30 18:19 2012-01-04 22:19 no start_47 - end_41
49 SAT119_1 2012-01-06 06:18 2012-01-07 22:19 no start_49 - end_47
59 SAT119_1 2012-01-20 02:19 2012-01-21 22:18 no start_59 - end_49
61 SAT119_1 2012-01-22 18:18 2012-01-24 14:18 no start_61 - end_59
63 SAT119_1 2012-01-26 06:18 2012-01-28 14:20 no start_63 - end_61
(Where start_37 and so on indicates the "clus_start" of cluster id 37)
I believe that, at first, the time has to be calculated for the first two rows, then the value in the column "within" is filled, and then the next calculations will be made considering the value (yes/no) on that column.
But I am a bit stuck, so far I could only apply the difftime()
function to get the times between the clusters:
df$time_diff <- as.numeric(with(df, difftime(clus_start, ave(clus_end, period_id,FUN=lag), units='days')))
But this does not account for the yes/no column... Any idea how to incorporate this step?
Here is an example of the dataset:
structure(list(clus_start = structure(c(1323353941, 1323368281,
1323901141, 1324160281, 1324275541, 1325269141, 1325830681, 1327025941,
1327256281, 1327558681), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
clus_end = structure(c(1323526681, 1323498001, 1324073881,
1324232281, 1325254681, 1325715541, 1325974741, 1327184281,
1327414681, 1327760401), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
cluster_id = c("36", "37", "40",
"41", "42", "47", "49", "59",
"61", "63"), period_id = c("SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1")), row.names = c(2L, 3L,
4L, 5L, 6L, 8L, 9L, 13L, 14L, 15L), class = "data.frame")
You can produce an output matching your desired format, where:
df <- structure(list(clus_start = structure(c(1323353941, 1323368281,
1323901141, 1324160281, 1324275541, 1325269141, 1325830681, 1327025941,
1327256281, 1327558681), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
clus_end = structure(c(1323526681, 1323498001, 1324073881,
1324232281, 1325254681, 1325715541, 1325974741, 1327184281,
1327414681, 1327760401), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
cluster_id = c("36", "37", "40",
"41", "42", "47", "49", "59",
"61", "63"), period_id = c("SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1")), row.names = c(2L, 3L,
4L, 5L, 6L, 8L, 9L, 13L, 14L, 15L), class = "data.frame")#
analyze_clusters <- function(df) {
# Convert time columns to POSIXct if they aren't already
df$clus_start <- as.POSIXct(df$clus_start, tz = "GMT")
df$clus_end <- as.POSIXct(df$clus_end, tz = "GMT")
# Initialize result columns
df$overlap <- "no"
df$time_diff <- NA
# Process each period_id separately
result <- df %>%
group_by(period_id) %>%
group_modify(~{
dat <- .x
n <- nrow(dat)
# Reference cluster index (the one we're comparing against)
ref_idx <- 1
for(i in 2:n) {
# Add 12 hours to the reference end time
ref_end_plus_12h <- dat$clus_end[ref_idx] + 12 * 3600
print(paste0("Ref: ",ref_end_plus_12h, " for i: ", i))
# Check if current cluster starts within reference cluster (including 12h buffer)
if(dat$clus_start[i] <= ref_end_plus_12h) {
print(paste0(dat$clus_start[i], " <= ", ref_end_plus_12h, "!"))
print("Overlap-> yes")
dat$overlap[i] <- "yes"
# Calculate time difference from reference cluster
dat$time_diff[i] <- as.numeric(
difftime(dat$clus_start[i], dat$clus_end[ref_idx], units = "days")
)
} else {
print("Overlap-> no")
# If no overlap, calculate time difference and update reference
dat$time_diff[i] <- as.numeric(
difftime(dat$clus_start[i], dat$clus_end[ref_idx], units = "days")
)
ref_idx <- i
print(paste0("New ref index: ", i))
}
}
dat
}) %>%
ungroup()
return(result)
}
res <- analyze_clusters(df)
But then my result looks different from yours:
period_id clus_start clus_end cluster_id overlap time_diff
<chr> <dttm> <dttm> <chr> <chr> <dbl>
1 SAT119_1 2011-12-08 14:19:01 2011-12-10 14:18:01 36 no NA
2 SAT119_1 2011-12-08 18:18:01 2011-12-10 06:20:01 37 yes -1.83
3 SAT119_1 2011-12-14 22:19:01 2011-12-16 22:18:01 40 no 4.33
4 SAT119_1 2011-12-17 22:18:01 2011-12-18 18:18:01 41 no 1
5 SAT119_1 2011-12-19 06:19:01 2011-12-30 14:18:01 42 no 0.501
6 SAT119_1 2011-12-30 18:19:01 2012-01-04 22:19:01 47 yes 0.167
7 SAT119_1 2012-01-06 06:18:01 2012-01-07 22:19:01 49 no 6.67
8 SAT119_1 2012-01-20 02:19:01 2012-01-21 22:18:01 59 no 12.2
9 SAT119_1 2012-01-22 18:18:01 2012-01-24 14:18:01 61 no 0.833
10 SAT119_1 2012-01-26 06:18:01 2012-01-28 14:20:01 63 no 1.67