r

Extract rows for the first week of data that meets criteria in R


I have longitudinal lab data where each person in a dataset has a line for every day they measured their A1C and uploaded it via an app. I need help identifying and extracting the first 'valid' week of data where the person uploaded their measurement on least 5 of 7 days in a given week. However, if they had data for 6 days or 7, I want to extract all possible measures for that week.

An example of the data would be this:

  person_id       date a1c
    1000492 2019-09-11 5.5
    1000492 2019-09-22 5.7
    1000492 2019-09-23 5.5
    1000492 2019-09-24 5.6
    1000492 2019-10-15 5.5
    1000492 2019-10-16 5.6
    1000492 2019-10-18 5.5
    1000492 2019-10-19 5.7
    1000492 2019-10-20 5.5
    1000492 2019-11-14 5.6
    1000492 2019-11-15 5.5 
    1000492 2019-12-16 5.6
    1000499 2019-09-11 5.5
    1000499 2019-09-12 5.7
    1000499 2019-09-26 5.5
    1000499 2019-09-27 5.6
    1000499 2019-09-28 5.5
    1000499 2019-09-29 5.7
    1000499 2019-09-30 5.5
    1000499 2019-10-01 5.6
    1000499 2019-10-02 5.5

So ideally, I would get a resulting dataframe of something like this:

  person_id       date a1c
    1000492 2019-10-15 5.5
    1000492 2019-10-16 5.6
    1000492 2019-10-18 5.5
    1000492 2019-10-19 5.7
    1000492 2019-10-20 5.5
    1000499 2019-09-26 5.5
    1000499 2019-09-27 5.6
    1000499 2019-09-28 5.5
    1000499 2019-09-29 5.7
    1000499 2019-09-30 5.5
    1000499 2019-10-01 5.6
    1000499 2019-10-02 5.5

Solution

  • To get your required result of measurements which are within in the first valid 5-7 day time windows, do this:

    library(dplyr)
    library(lubridate)
    
    df <- data.frame(
      person_id = c(rep(1000492, 12), rep(1000499, 9)),
      date = as.Date(c( "2019-09-11", "2019-09-22", "2019-09-23", "2019-09-24","2019-10-15", "2019-10-16", "2019-10-18", "2019-10-19", "2019-10-20","2019-11-14", "2019-11-15", "2019-12-16","2019-09-11", "2019-09-12", "2019-09-26", "2019-09-27", "2019-09-28","2019-09-29", "2019-09-30", "2019-10-01", "2019-10-02")),
      a1c = c(5.5, 5.7, 5.5, 5.6, 5.5, 5.6, 5.5, 5.7, 5.5, 5.6, 5.5, 5.6, 5.5, 5.7, 5.5, 5.6, 5.5, 5.7, 5.5, 5.6, 5.5
      )
    )    
    
    res <- df %>%
      arrange(person_id, date) %>%
      group_by(person_id) %>%
      mutate(
        window_start = date,
        window_end = date + days(6),
        measurements_in_window = sapply(date, function(ws) sum(date >= ws & date <= ws + days(6)))
      ) %>%
      filter(measurements_in_window >= 5) %>%
      filter(date == min(date)) %>%
      inner_join(df, by = "person_id") %>%
      filter(date.y >= window_start & date.y <= window_end) %>% mutate(date = date.y, a1c = a1c.y) %>% select(person_id, date, a1c)
    

    Output:

    > res
    # A tibble: 12 × 3
       person_id date         a1c
           <dbl> <date>     <dbl>
     1   1000492 2019-10-15   5.5
     2   1000492 2019-10-16   5.6
     3   1000492 2019-10-18   5.5
     4   1000492 2019-10-19   5.7
     5   1000492 2019-10-20   5.5
     6   1000499 2019-09-26   5.5
     7   1000499 2019-09-27   5.6
     8   1000499 2019-09-28   5.5
     9   1000499 2019-09-29   5.7
    10   1000499 2019-09-30   5.5
    11   1000499 2019-10-01   5.6
    12   1000499 2019-10-02   5.5