rdatelaglead

How can I use the lag function to get the date of the most recent occurrence?


I have a dataframe with columns 'person', 'NoShow', and 'date'.

   Person NoShow       date
1     123      0 2019-01-01
2     123      1 2019-04-01
3     123      0 2020-01-01
4     123      1 2021-01-01
5     123      0 2021-04-30
6     123      0 2022-01-05
7     334      0 2019-07-07
8     334      1 2019-10-11
9     334      1 2020-07-07
10    334      0 2021-01-04

Code to create the dataframe:

df <- data.frame(Person = c('123','123','123','123','123','123','334','334','334','334'), 
                   NoShow = c(0,1,0,1,0,0,0,1,1,0), 
                   date = c('2019-01-01','2019-04-01','2020-01-01','2021-01-01','2021-04-30','2022-01-05','2019-07-07','2019-10-11','2020-07-07','2021-01-04')
                   )

I used the lag function to create two new columns: (1) 'prior_noshow', which indicates whether a person missed any previous appointments; and (2) 'prior_noshow_f', which indicates the number of previous appointments a person missed.

This is the code I used to create these two additional columns:

library(tidyverse)

df %>% 
  group_by(Person) %>%
  mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
         prior_noshow_f = lag(cumsum(NoShow)))

This is the dataframe with the two new columns:

   Person NoShow date       prior_noshow prior_noshow_f
   <chr>   <dbl> <chr>             <dbl>          <dbl>
 1 123         0 2019-01-01           NA             NA
 2 123         1 2019-04-01            0              0
 3 123         0 2020-01-01            1              1
 4 123         1 2021-01-01            1              1
 5 123         0 2021-04-30            1              2
 6 123         0 2022-01-05            1              2
 7 334         0 2019-07-07           NA             NA
 8 334         1 2019-10-11            0              0
 9 334         1 2020-07-07            1              1
10 334         0 2021-01-04            1              2

Now, I'd also like to add a column 'prior_noshow_date', which should show the date of the most recent no-show per person.

So, for example, rows 3 and 4 should have prior_noshow_date==2019-04-01 and rows 5 and 6 should have prior_noshow_date==2021-01-01.


Solution

  • Using data.table, we can filter for each case.

    library(data.table)
    setDT(df)
    df$PrevMiss <- sapply(
      1:dim(df)[1], function(x) df[1:x, ][
        NoShow == 1 & Person == last(Person) & date != last(date) , max(date)])
    

    Output:

    > df
        Person NoShow       date   PrevMiss
     1:    123      0 2019-01-01       <NA>
     2:    123      1 2019-04-01       <NA>
     3:    123      0 2020-01-01 2019-04-01
     4:    123      1 2021-01-01 2019-04-01
     5:    123      0 2021-04-30 2021-01-01
     6:    123      0 2022-01-05 2021-01-01
     7:    334      0 2019-07-07       <NA>
     8:    334      1 2019-10-11       <NA>
     9:    334      1 2020-07-07 2019-10-11
    10:    334      0 2021-01-04 2020-07-07