dplyrdifftime

A way to combine start and stop date if there is overlap or if there is less than a 5 day gap


I am dealing with individuals who received a particular treatment and I want to know the time ranges they were treated, or within 5 days of ending a treatment and starting another.

r <- read.table(text="
   ID   Start_Date     End_Date
    1     05-06-18       05-10-18
    1     05-08-18       05-14-18  
    1     05-16-18       05-25-18
    1     06-28-19       07-02-19
    1     07-03-19       07-08-19
    2     04-20-18       04-23-18
    2     07-20/18       07-25-18 
    2     07-26-18       07-30-18 
    3     05-14-17       05-29-17", 
                stringsAsFactors=FALSE, header=TRUE)

This is what I had, and this is what I want:

r <- read.table(text="
   ID   Start_Date     End_Date
    1     05-06-18     05-25-18
    1     06-28-19     07-08-19
    2     04-20-18     07-30-18 
    3     05-14-17     05-29-17", 
                stringsAsFactors=FALSE, header=TRUE)

This is the code I have tried, but doesn't quite get what I am after.


Solution

  • Based on your data, I cannot determine how to achieve your example output based on your description. That's because the dates for ID == 2 have a gap > 5 days.

    However, applying the logic that creates your desired outcome for the other ID values, I have assumed that a treatment period is grouped if the gap between the end of previous treatment continues to be <= 5 days. For example, if a sequence of five treatments are 1, 3, 5, 4, and 5 days apart, this 18 day period should be grouped as one row.

    The workflow:

    The result returns two rows for ID == 2, which I believe is consistent with your explanation. If not, please update your question.

    library(lubridate)
    library(dplyr)
    
    r <- read.table(text = "
       ID   Start_Date     End_Date
        1     05-06-18       05-10-18
        1     05-08-18       05-14-18  
        1     05-16-18       05-25-18
        1     06-28-19       07-02-19
        1     07-03-19       07-08-19
        2     04-20-18       04-23-18
        2     07-20-18       07-25-18 
        2     07-26-18       07-30-18 
        3     05-14-17       05-29-17",
                    stringsAsFactors = FALSE, header = TRUE)
    
    # Convert date strings to date class columns
    r$Start_Date <- mdy(r$Start_Date)
    r$End_Date <- mdy(r$End_Date)
    
    # Determine start and end dates for treatment periods where treatments are 
    # grouped if gap between end of previous treatment continues to be <= 5 days
    r |>
      group_by(ID) |>
      mutate(tmp = as.integer(Start_Date - lag(End_Date, default = NA)),
             tmp = if_else(is.na(tmp) | tmp > 5, 0 , tmp),
             ID1 = cumsum(tmp > 5 | tmp == 0)) |>
      ungroup() |>
      summarise(Start_Date = min(Start_Date),
                End_Date = max(End_Date), .by = c(ID, ID1)) |>
      select(-ID1)
    
    # # A tibble: 5 × 3
    #      ID Start_Date End_Date  
    #   <int> <date>     <date>    
    # 1     1 2018-05-06 2018-05-25
    # 2     1 2019-06-28 2019-07-08
    # 3     2 2018-04-20 2018-04-23
    # 4     2 2018-07-20 2018-07-30
    # 5     3 2017-05-14 2017-05-29