rdplyrtidyverse

How to regroup lines and values when there is no gap between dates?


I have a set of data representing a patient's hospital stays.

Some of these stays may contain an event (Boolean column). Starting with an "event" stay, the aim is to group together all the following stays if they are consecutive in time, in other words, if there is no day between the end of one and the start of the next. For these grouped stays, the costs of each stay is also added together to obtain an overall cost.

Here is an exemple:

Input:

enter image description here

Expected output:

enter image description here

Here some code with those example data:

library(tidyverse)
library(lubridate)

input <- tribble(
 ~ begin          ,  ~ end            , ~ event,  ~ cost,
 ymd("2022/11/21"),  ymd("2023/11/30"), FALSE   ,      6,
 ymd("2023/01/01"),  ymd("2023/01/03"), TRUE   ,       8,
 ymd("2023/01/07"),  ymd("2023/01/10"), FALSE  ,       2,
 ymd("2023/01/14"),  ymd("2023/01/19"), TRUE   ,       3,
 ymd("2023/01/19"),  ymd("2023/01/25"), FALSE  ,       7,
 ymd("2023/02/14"),  ymd("2023/03/01"), FALSE  ,       5,
 ymd("2023/04/03"),  ymd("2023/04/11"), TRUE   ,       5,
 ymd("2023/04/11"),  ymd("2023/04/21"), FALSE  ,       2,
 ymd("2023/04/21"),  ymd("2023/04/23"), FALSE  ,       4,
 ymd("2023/05/01"),  ymd("2023/06/22"), FALSE  ,      15,
) 

expected_ouput <- tribble(
 ~ begin          ,  ~ end            , ~ cost,
 ymd("2023/01/01"),  ymd("2023/01/03"),      8,
 ymd("2023/01/14"),  ymd("2023/01/25"),     10,
 ymd("2023/04/03"),  ymd("2023/04/23"),     11,
)

I have made up some boring solution - with nested while loops - that does not look like tidyverse or R style to me. Please enlighten me!


Solution

  • You may use the following -

    library(dplyr)
    
    input %>%
      mutate(diff_days = as.integer(begin - lag(end)), 
             group = cumsum(event)) %>%
      filter(event | diff_days == 0) %>%
      summarise(begin = min(begin), 
                end = max(end), 
                cost = sum(cost), .by = group) %>%
      select(-group)
    
    # A tibble: 3 × 3
    #  begin      end         cost
    #  <date>     <date>     <dbl>
    #1 2023-01-01 2023-01-03     8
    #2 2023-01-14 2023-01-25    10
    #3 2023-04-03 2023-04-23    11
    

    We create diff_days column to check if the days are consecutive, keep only those rows where event is TRUE or days are consecutive. I have also created a group column to identify clear group of days based on event variable. For each group we then sum the cost and take min and max values of begin and end dates respectively.