rdplyrdifftime

In R, is there a way to calculate the number of days between the end of one event and the beginning of another?


(Note: Hotel data used as illustrative example.)

I'm working with a dataset that has multiple records (e.g. 'hotelStays') for each customer (custID). My goal is to get the number of days since a customer's previous stay as a new column in the data frame (i.e. each customer's first stay would have 'NA' as its value). To do so, I want to subtract each customer's previous checkOutDt from their current checkInDt. However, when I attempt to do so using lag(), all values in the new column are 'NA'.

Below is an example of the type of data with which I'm working.

custID stayID stayDt checkInDt checkOutDt
AAAAA 11111 01/15/1995 01/10/1995 01/17/1995
BBBBB 11112 02/08/1995 02/02/1995 02/25/1995
AAAAA 11113 03/01/1995 03/01/1995 03/03/1995
AAAAA 11114 06/24/1995 06/22/1995 07/02/1995
BBBBB 11115 10/02/1995 10/01/1995 10/10/1995
CCCCC 11116 01/08/1996 01/05/1996 01/17/1996
AAAAA 11117 05/15/1996 05/10/1996 05/28/1996

Ideally, the new column, 'daysSinceLastStay', would have the following values:

daysSinceLastStay
NA
NA
43
111
218
NA
313

However, I think I need to sort by custID and stayDt first.


Below is my current attempt at the code:

hotelData <- hotelData %>%
                arrange(custID, stayDt) %>%
                mutate(daysSinceLastStay = 
                       checkInDt - lag(checkOutDt))

Any advice is greatly appreciated!


Solution

  • Based on the data you're expecting, it seems like you're needing to make use of the group_by() function. This should get you what you're looking for.

    # t*r*ibble, for creating data by row
    hotelData <- tibble::tribble(
      ~custID, ~stayID, ~stayDt, ~checkInDt, ~checkOutDt,
      "AAAAA",  11111,  "01/15/1995",   "01/10/1995",   "01/17/1995",
      "BBBBB",  11112,  "02/08/1995",   "02/02/1995",   "02/25/1995",
      "AAAAA",  11113,  "03/01/1995",   "03/01/1995",   "03/03/1995",
      "AAAAA",  11114,  "06/24/1995",   "06/22/1995",   "07/02/1995",
      "BBBBB",  11115,  "10/02/1995",   "10/01/1995",   "10/10/1995",
      "CCCCC",  11116,  "01/08/1996",   "01/05/1996",   "01/17/1996",
      "AAAAA",  11117,  "05/15/1996",   "05/10/1996",   "05/28/1996"
    )
    
    # convert the date columns to the proper data type
    # then, sort the data by customer ID and stayID
    hotelData <- hotelData %>%
      mutate(across(stayDt:checkOutDt, lubridate::mdy)) %>%
      arrange(custID, stayID)
    
    # within each customer, take the difference in days
    hotelData %>%
      group_by(custID) %>%
      mutate(daysSinceLastStay = as.numeric(checkInDt - lag(checkOutDt)))
    
    # A tibble: 7 x 6
    # Groups:   custID [3]
      custID stayID stayDt     checkInDt  checkOutDt daysSinceLastStay
      <chr>   <dbl> <date>     <date>     <date>                 <dbl>
    1 AAAAA   11111 1995-01-15 1995-01-10 1995-01-17                NA
    2 AAAAA   11113 1995-03-01 1995-03-01 1995-03-03                43
    3 AAAAA   11114 1995-06-24 1995-06-22 1995-07-02               111
    4 AAAAA   11117 1996-05-15 1996-05-10 1996-05-28               313
    5 BBBBB   11112 1995-02-08 1995-02-02 1995-02-25                NA
    6 BBBBB   11115 1995-10-02 1995-10-01 1995-10-10               218
    7 CCCCC   11116 1996-01-08 1996-01-05 1996-01-17                NA