Sample data:
data <- data.frame(
year = c(2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020),
patient_id = c(101, 102, 103, 104, 102, 102, 106, 105, 105, 107, 108, 109, 105, 105, 201, 202, 203, 204, 205, 205, 205, 209, 208),
discharge_date = as.Date(c("1/1/2018", "1/5/2018", "1/8/2018", "2/5/2018", "2/10/2018", "2/11/2018", "3/1/2018", "1/2/2019", "1/10/2019", "3/1/2019", "3/5/2019", "3/25/2019", "5/5/2019", "5/6/2019", "1/1/2020", "2/1/2020", "2/10/2020", "3/3/2020", "4/1/2020", "4/2/2020", "4/3/2020", "6/17/2020", "8/8/2020"), format = "%m/%d/%Y"),
contagious_admission = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0)
)
I am trying to create the column latest_discharge_date
, which takes in the discharge_date
value following the following logic:
If patient_id
of current row is different from the leading patient_id
, then it takes in the discharge_date
from current row.
If patient_id of current row is the same as the leading patient_id
, AND in both rows contagious_admission == 0
, then it takes in the discharge_date
from current row.
But when there is leading single or multiple consecutive rows with same patient_id
, AND in the first row of such series, contagious_admission == 0
, in the latter ones, contagious_admission == 1
, the latest_discharge_date
of the first row should take in the latest or the maximum discharge_date
from the consecutive rows of contagious_admission == 1
.
1st Attempt:
data |>
mutate(
latest_discharge_date = case_when(
contagious_admission == 0 & lead(contagious_admission) == 0 ~ discharge_date,
contagious_admission == 0 & lead(contagious_admission) == 1 ~ lead(discharge_date)
, TRUE ~ NA)
)
Everything worked fine, but if you look at patient_id = 205
, latest_discharge_date
of the index row (patient_id == 205 & contagious_admission == 0
) is taking in "2020-04-02"
. But I need it to take the next leading date (belonging to the same patient_id
and contagious_admission == 1
"group"), which is "2020-04-03"
.
2nd Attempt:
data |>
mutate(
latest_discharge_date = case_when(
contagious_admission == 0 & lead(contagious_admission) == 0 ~ discharge_date,
contagious_admission == 0 & lead(contagious_admission) == 1 ~ lead(pmax(lead(discharge_date)))
, TRUE ~ NA)
)
This one nails the ones with multiple leading rows of contagious_admission == 1
, but overshoots the single ones.
One approach is to assign a count for each admission for each patient_id
(this allows for patients to have more than one admission easily). Each row where the contagious_admission
is zero, it is considered a new admission.
Then, you can group based on both the patient_id
and the count, and if contagious_admission
is 0, then use the last discharge_date
, otherwise mark as missing (as in the example).
library(tidyverse)
data |>
group_by(patient_id) |>
mutate(adm_count = cumsum(contagious_admission == 0)) |>
group_by(patient_id, adm_count) |>
mutate(latest_discharge_date = if_else(
contagious_admission == 0,
last(discharge_date),
NA)) |>
ungroup() |>
select(-adm_count)
Output
year patient_id discharge_date contagious_admission latest_discharge_date
<dbl> <dbl> <date> <dbl> <date>
1 2018 101 2018-01-01 0 2018-01-01
2 2018 102 2018-01-05 0 2018-01-05
3 2018 103 2018-01-08 0 2018-01-08
4 2018 104 2018-02-05 0 2018-02-05
5 2018 102 2018-02-10 0 2018-02-11
6 2018 102 2018-02-11 1 NA
7 2018 106 2018-03-01 0 2018-03-01
8 2019 105 2019-01-02 0 2019-01-02
9 2019 105 2019-01-10 0 2019-01-10
10 2019 107 2019-03-01 0 2019-03-01
11 2019 108 2019-03-05 0 2019-03-05
12 2019 109 2019-03-25 0 2019-03-25
13 2019 105 2019-05-05 0 2019-05-06
14 2019 105 2019-05-06 1 NA
15 2020 201 2020-01-01 0 2020-01-01
16 2020 202 2020-02-01 0 2020-02-01
17 2020 203 2020-02-10 0 2020-02-10
18 2020 204 2020-03-03 0 2020-03-03
19 2020 205 2020-04-01 0 2020-04-03
20 2020 205 2020-04-02 1 NA
21 2020 205 2020-04-03 1 NA
22 2020 209 2020-06-17 0 2020-06-17
23 2020 208 2020-08-08 0 2020-08-08