(Note: Hotel data used as illustrative example.)
I'm working with a dataset that has multiple records (e.g. 'hotelStays') for each customer (custID). My goal is to get the number of days since a customer's previous stay as a new column in the data frame (i.e. each customer's first stay would have 'NA' as its value). To do so, I want to subtract each customer's previous checkOutDt from their current checkInDt. However, when I attempt to do so using lag(), all values in the new column are 'NA'.
Below is an example of the type of data with which I'm working.
custID | stayID | stayDt | checkInDt | checkOutDt |
---|---|---|---|---|
AAAAA | 11111 | 01/15/1995 | 01/10/1995 | 01/17/1995 |
BBBBB | 11112 | 02/08/1995 | 02/02/1995 | 02/25/1995 |
AAAAA | 11113 | 03/01/1995 | 03/01/1995 | 03/03/1995 |
AAAAA | 11114 | 06/24/1995 | 06/22/1995 | 07/02/1995 |
BBBBB | 11115 | 10/02/1995 | 10/01/1995 | 10/10/1995 |
CCCCC | 11116 | 01/08/1996 | 01/05/1996 | 01/17/1996 |
AAAAA | 11117 | 05/15/1996 | 05/10/1996 | 05/28/1996 |
Ideally, the new column, 'daysSinceLastStay', would have the following values:
daysSinceLastStay |
---|
NA |
NA |
43 |
111 |
218 |
NA |
313 |
However, I think I need to sort by custID and stayDt first.
Below is my current attempt at the code:
hotelData <- hotelData %>%
arrange(custID, stayDt) %>%
mutate(daysSinceLastStay =
checkInDt - lag(checkOutDt))
Any advice is greatly appreciated!
Based on the data you're expecting, it seems like you're needing to make use of the group_by()
function. This should get you what you're looking for.
# t*r*ibble, for creating data by row
hotelData <- tibble::tribble(
~custID, ~stayID, ~stayDt, ~checkInDt, ~checkOutDt,
"AAAAA", 11111, "01/15/1995", "01/10/1995", "01/17/1995",
"BBBBB", 11112, "02/08/1995", "02/02/1995", "02/25/1995",
"AAAAA", 11113, "03/01/1995", "03/01/1995", "03/03/1995",
"AAAAA", 11114, "06/24/1995", "06/22/1995", "07/02/1995",
"BBBBB", 11115, "10/02/1995", "10/01/1995", "10/10/1995",
"CCCCC", 11116, "01/08/1996", "01/05/1996", "01/17/1996",
"AAAAA", 11117, "05/15/1996", "05/10/1996", "05/28/1996"
)
# convert the date columns to the proper data type
# then, sort the data by customer ID and stayID
hotelData <- hotelData %>%
mutate(across(stayDt:checkOutDt, lubridate::mdy)) %>%
arrange(custID, stayID)
# within each customer, take the difference in days
hotelData %>%
group_by(custID) %>%
mutate(daysSinceLastStay = as.numeric(checkInDt - lag(checkOutDt)))
# A tibble: 7 x 6
# Groups: custID [3]
custID stayID stayDt checkInDt checkOutDt daysSinceLastStay
<chr> <dbl> <date> <date> <date> <dbl>
1 AAAAA 11111 1995-01-15 1995-01-10 1995-01-17 NA
2 AAAAA 11113 1995-03-01 1995-03-01 1995-03-03 43
3 AAAAA 11114 1995-06-24 1995-06-22 1995-07-02 111
4 AAAAA 11117 1996-05-15 1996-05-10 1996-05-28 313
5 BBBBB 11112 1995-02-08 1995-02-02 1995-02-25 NA
6 BBBBB 11115 1995-10-02 1995-10-01 1995-10-10 218
7 CCCCC 11116 1996-01-08 1996-01-05 1996-01-17 NA