I am dealing with individuals who received a particular treatment and I want to know the time ranges they were treated, or within 5 days of ending a treatment and starting another.
r <- read.table(text="
ID Start_Date End_Date
1 05-06-18 05-10-18
1 05-08-18 05-14-18
1 05-16-18 05-25-18
1 06-28-19 07-02-19
1 07-03-19 07-08-19
2 04-20-18 04-23-18
2 07-20/18 07-25-18
2 07-26-18 07-30-18
3 05-14-17 05-29-17",
stringsAsFactors=FALSE, header=TRUE)
This is what I had, and this is what I want:
r <- read.table(text="
ID Start_Date End_Date
1 05-06-18 05-25-18
1 06-28-19 07-08-19
2 04-20-18 07-30-18
3 05-14-17 05-29-17",
stringsAsFactors=FALSE, header=TRUE)
This is the code I have tried, but doesn't quite get what I am after.
Based on your data, I cannot determine how to achieve your example output based on your description. That's because the dates for ID == 2 have a gap > 5 days.
However, applying the logic that creates your desired outcome for the other ID values, I have assumed that a treatment period is grouped if the gap between the end of previous treatment continues to be <= 5 days. For example, if a sequence of five treatments are 1, 3, 5, 4, and 5 days apart, this 18 day period should be grouped as one row.
The workflow:
The result returns two rows for ID == 2, which I believe is consistent with your explanation. If not, please update your question.
library(lubridate)
library(dplyr)
r <- read.table(text = "
ID Start_Date End_Date
1 05-06-18 05-10-18
1 05-08-18 05-14-18
1 05-16-18 05-25-18
1 06-28-19 07-02-19
1 07-03-19 07-08-19
2 04-20-18 04-23-18
2 07-20-18 07-25-18
2 07-26-18 07-30-18
3 05-14-17 05-29-17",
stringsAsFactors = FALSE, header = TRUE)
# Convert date strings to date class columns
r$Start_Date <- mdy(r$Start_Date)
r$End_Date <- mdy(r$End_Date)
# Determine start and end dates for treatment periods where treatments are
# grouped if gap between end of previous treatment continues to be <= 5 days
r |>
group_by(ID) |>
mutate(tmp = as.integer(Start_Date - lag(End_Date, default = NA)),
tmp = if_else(is.na(tmp) | tmp > 5, 0 , tmp),
ID1 = cumsum(tmp > 5 | tmp == 0)) |>
ungroup() |>
summarise(Start_Date = min(Start_Date),
End_Date = max(End_Date), .by = c(ID, ID1)) |>
select(-ID1)
# # A tibble: 5 × 3
# ID Start_Date End_Date
# <int> <date> <date>
# 1 1 2018-05-06 2018-05-25
# 2 1 2019-06-28 2019-07-08
# 3 2 2018-04-20 2018-04-23
# 4 2 2018-07-20 2018-07-30
# 5 3 2017-05-14 2017-05-29