rlaglead

How can I use the lag function to skip over rows and provide number of previous occurrences


I have a dataframe with the first three columns 'Person', 'NoShow', and 'date'. I want to create the two additional columns 'prior_noshow' and 'prior_noshow_f', but I cannot figure out how to do this using the lag function.

   Person NoShow       date prior_noshow prior_noshow_f
1     123      0 2019-01-01           NA             NA
2     123      1 2019-04-01            0              0
3     123      0 2020-01-01            1              1
4     123      1 2021-01-01            1              1
5     123      0 2021-04-30            1              2
6     123      0 2022-01-05            1              2
7     334      0 2019-07-07           NA             NA
8     334      1 2019-10-11            0              0
9     334      1 2020-07-07            1              1
10    334      0 2021-01-04            1              2

Column 'prior_noshow' should show whether a person missed a previous appointment. So, person 123 had their first noshow on 2019-04-01, and I want every following row for person 123 to have prior_noshow==1.

Column 'prior_noshow_f" should show the frequency of previously missed appointments. So, person 123 should have prior_noshow_f==1 for dates 2020-01-01 and 2021-01-01 and prior_noshow_f==2 for dates 2021-04-30 and 2022-01-05.

Code to create the dataset:

df <- data.frame(Person = c('123','123','123','123','123','123','334','334','334','334'), 
                   NoShow = c(0,1,0,1,0,0,0,1,1,0), 
                   date = c('2019-01-01','2019-04-01','2020-01-01','2021-01-01','2021-04-30','2022-01-05','2019-07-07','2019-10-11','2020-07-07','2021-01-04')
                   )

I've tried different approaches I saw in previous discussions here regarding the lag-function, but I haven't able to figure this out.

Any help would be greatly appreciated!


Solution

  • You need to lag the cumsum of NoShow to get prior_noshow_f, and prior_noshow is just as.numeric(prior_noshow_f > 0), so you could do:

    library(tidyverse)
    
    df %>% 
      group_by(Person) %>%
      mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
             prior_noshow_f = lag(cumsum(NoShow)))
    #> # A tibble: 10 x 5
    #> # Groups:   Person [2]
    #>    Person NoShow date       prior_noshow prior_noshow_f
    #>     <int>  <int> <chr>             <dbl>          <int>
    #>  1    123      0 2019-01-01           NA             NA
    #>  2    123      1 2019-04-01            0              0
    #>  3    123      0 2020-01-01            1              1
    #>  4    123      1 2021-01-01            1              1
    #>  5    123      0 2021-04-30            1              2
    #>  6    123      0 2022-01-05            1              2
    #>  7    334      0 2019-07-07           NA             NA
    #>  8    334      1 2019-10-11            0              0
    #>  9    334      1 2020-07-07            1              1
    #> 10    334      0 2021-01-04            1              2
    

    Created on 2022-08-22 with reprex v2.0.2