rdatelead

obtain maximum date from multiple leading dates


Sample data:

data <- data.frame(
  year = c(2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020),
  patient_id = c(101, 102, 103, 104, 102, 102, 106, 105, 105, 107, 108, 109, 105, 105, 201, 202, 203, 204, 205, 205, 205, 209, 208),
  discharge_date = as.Date(c("1/1/2018", "1/5/2018", "1/8/2018", "2/5/2018", "2/10/2018", "2/11/2018", "3/1/2018", "1/2/2019", "1/10/2019", "3/1/2019", "3/5/2019", "3/25/2019", "5/5/2019", "5/6/2019", "1/1/2020", "2/1/2020", "2/10/2020", "3/3/2020", "4/1/2020", "4/2/2020", "4/3/2020", "6/17/2020", "8/8/2020"), format = "%m/%d/%Y"),
  contagious_admission = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0)
)

I am trying to create the column latest_discharge_date, which takes in the discharge_date value following the following logic: If patient_id of current row is different from the leading patient_id, then it takes in the discharge_date from current row. If patient_id of current row is the same as the leading patient_id, AND in both rows contagious_admission == 0, then it takes in the discharge_date from current row. But when there is leading single or multiple consecutive rows with same patient_id, AND in the first row of such series, contagious_admission == 0, in the latter ones, contagious_admission == 1, the latest_discharge_date of the first row should take in the latest or the maximum discharge_date from the consecutive rows of contagious_admission == 1.

1st Attempt:

data |>
  mutate(
    latest_discharge_date = case_when(
      contagious_admission == 0 & lead(contagious_admission) == 0 ~ discharge_date,
      contagious_admission == 0 & lead(contagious_admission) == 1 ~ lead(discharge_date)
    , TRUE ~ NA)
  )

Everything worked fine, but if you look at patient_id = 205, latest_discharge_date of the index row (patient_id == 205 & contagious_admission == 0) is taking in "2020-04-02". But I need it to take the next leading date (belonging to the same patient_id and contagious_admission == 1 "group"), which is "2020-04-03".

2nd Attempt:

data |>
  mutate(
    latest_discharge_date = case_when(
      contagious_admission == 0 & lead(contagious_admission) == 0 ~ discharge_date,
      contagious_admission == 0 & lead(contagious_admission) == 1 ~ lead(pmax(lead(discharge_date)))
    , TRUE ~ NA)
  )

This one nails the ones with multiple leading rows of contagious_admission == 1, but overshoots the single ones.


Solution

  • One approach is to assign a count for each admission for each patient_id (this allows for patients to have more than one admission easily). Each row where the contagious_admission is zero, it is considered a new admission.

    Then, you can group based on both the patient_id and the count, and if contagious_admission is 0, then use the last discharge_date, otherwise mark as missing (as in the example).

    library(tidyverse)
    
    data |>
      group_by(patient_id) |>
      mutate(adm_count = cumsum(contagious_admission == 0)) |>
      group_by(patient_id, adm_count) |>
      mutate(latest_discharge_date = if_else(
        contagious_admission == 0,
        last(discharge_date),
        NA)) |>
      ungroup() |>
      select(-adm_count)
    

    Output

        year patient_id discharge_date contagious_admission latest_discharge_date
       <dbl>      <dbl> <date>                        <dbl> <date>               
     1  2018        101 2018-01-01                        0 2018-01-01           
     2  2018        102 2018-01-05                        0 2018-01-05           
     3  2018        103 2018-01-08                        0 2018-01-08           
     4  2018        104 2018-02-05                        0 2018-02-05           
     5  2018        102 2018-02-10                        0 2018-02-11           
     6  2018        102 2018-02-11                        1 NA                   
     7  2018        106 2018-03-01                        0 2018-03-01           
     8  2019        105 2019-01-02                        0 2019-01-02           
     9  2019        105 2019-01-10                        0 2019-01-10           
    10  2019        107 2019-03-01                        0 2019-03-01           
    11  2019        108 2019-03-05                        0 2019-03-05           
    12  2019        109 2019-03-25                        0 2019-03-25           
    13  2019        105 2019-05-05                        0 2019-05-06           
    14  2019        105 2019-05-06                        1 NA                   
    15  2020        201 2020-01-01                        0 2020-01-01           
    16  2020        202 2020-02-01                        0 2020-02-01           
    17  2020        203 2020-02-10                        0 2020-02-10           
    18  2020        204 2020-03-03                        0 2020-03-03           
    19  2020        205 2020-04-01                        0 2020-04-03           
    20  2020        205 2020-04-02                        1 NA                   
    21  2020        205 2020-04-03                        1 NA                   
    22  2020        209 2020-06-17                        0 2020-06-17           
    23  2020        208 2020-08-08                        0 2020-08-08