A way to combine start and stop date if there is overlap or if there is less than a 5 day gap

I am dealing with individuals who received a particular treatment and I want to know the time ranges they were treated, or within 5 days of ending a treatment and starting another.

r <- read.table(text="
   ID   Start_Date     End_Date
    1     05-06-18       05-10-18
    1     05-08-18       05-14-18  
    1     05-16-18       05-25-18
    1     06-28-19       07-02-19
    1     07-03-19       07-08-19
    2     04-20-18       04-23-18
    2     07-20/18       07-25-18 
    2     07-26-18       07-30-18 
    3     05-14-17       05-29-17", 
                stringsAsFactors=FALSE, header=TRUE)

This is what I had, and this is what I want:

r <- read.table(text="
   ID   Start_Date     End_Date
    1     05-06-18     05-25-18
    1     06-28-19     07-08-19
    2     04-20-18     07-30-18 
    3     05-14-17     05-29-17", 
                stringsAsFactors=FALSE, header=TRUE)

This is the code I have tried, but doesn't quite get what I am after.

Solution

Based on your data, I cannot determine how to achieve your example output based on your description. That's because the dates for ID == 2 have a gap > 5 days.

However, applying the logic that creates your desired outcome for the other ID values, I have assumed that a treatment period is grouped if the gap between the end of previous treatment continues to be <= 5 days. For example, if a sequence of five treatments are 1, 3, 5, 4, and 5 days apart, this 18 day period should be grouped as one row.

The workflow:

return the number of days between the start of one treatment and the end of the previous treatment
set the first row of every ID, and every gap > 5 days, to 0
create a column identifying the start and end of each treatment period
return the minimum Start_Date and maximum End_Date for each ID treatment period

The result returns two rows for ID == 2, which I believe is consistent with your explanation. If not, please update your question.

library(lubridate)
library(dplyr)

r <- read.table(text = "
   ID   Start_Date     End_Date
    1     05-06-18       05-10-18
    1     05-08-18       05-14-18  
    1     05-16-18       05-25-18
    1     06-28-19       07-02-19
    1     07-03-19       07-08-19
    2     04-20-18       04-23-18
    2     07-20-18       07-25-18 
    2     07-26-18       07-30-18 
    3     05-14-17       05-29-17",
                stringsAsFactors = FALSE, header = TRUE)

# Convert date strings to date class columns
r$Start_Date <- mdy(r$Start_Date)
r$End_Date <- mdy(r$End_Date)

# Determine start and end dates for treatment periods where treatments are 
# grouped if gap between end of previous treatment continues to be <= 5 days
r |>
  group_by(ID) |>
  mutate(tmp = as.integer(Start_Date - lag(End_Date, default = NA)),
         tmp = if_else(is.na(tmp) | tmp > 5, 0 , tmp),
         ID1 = cumsum(tmp > 5 | tmp == 0)) |>
  ungroup() |>
  summarise(Start_Date = min(Start_Date),
            End_Date = max(End_Date), .by = c(ID, ID1)) |>
  select(-ID1)

# # A tibble: 5 × 3
#      ID Start_Date End_Date  
#   <int> <date>     <date>    
# 1     1 2018-05-06 2018-05-25
# 2     1 2019-06-28 2019-07-08
# 3     2 2018-04-20 2018-04-23
# 4     2 2018-07-20 2018-07-30
# 5     3 2017-05-14 2017-05-29