I have longitudinal lab data where each person in a dataset has a line for every day they measured their A1C and uploaded it via an app. I need help identifying and extracting the first 'valid' week of data where the person uploaded their measurement on least 5 of 7 days in a given week. However, if they had data for 6 days or 7, I want to extract all possible measures for that week.
An example of the data would be this:
person_id date a1c
1000492 2019-09-11 5.5
1000492 2019-09-22 5.7
1000492 2019-09-23 5.5
1000492 2019-09-24 5.6
1000492 2019-10-15 5.5
1000492 2019-10-16 5.6
1000492 2019-10-18 5.5
1000492 2019-10-19 5.7
1000492 2019-10-20 5.5
1000492 2019-11-14 5.6
1000492 2019-11-15 5.5
1000492 2019-12-16 5.6
1000499 2019-09-11 5.5
1000499 2019-09-12 5.7
1000499 2019-09-26 5.5
1000499 2019-09-27 5.6
1000499 2019-09-28 5.5
1000499 2019-09-29 5.7
1000499 2019-09-30 5.5
1000499 2019-10-01 5.6
1000499 2019-10-02 5.5
So ideally, I would get a resulting dataframe of something like this:
person_id date a1c
1000492 2019-10-15 5.5
1000492 2019-10-16 5.6
1000492 2019-10-18 5.5
1000492 2019-10-19 5.7
1000492 2019-10-20 5.5
1000499 2019-09-26 5.5
1000499 2019-09-27 5.6
1000499 2019-09-28 5.5
1000499 2019-09-29 5.7
1000499 2019-09-30 5.5
1000499 2019-10-01 5.6
1000499 2019-10-02 5.5
To get your required result of measurements which are within in the first valid 5-7 day time windows, do this:
library(dplyr)
library(lubridate)
df <- data.frame(
person_id = c(rep(1000492, 12), rep(1000499, 9)),
date = as.Date(c( "2019-09-11", "2019-09-22", "2019-09-23", "2019-09-24","2019-10-15", "2019-10-16", "2019-10-18", "2019-10-19", "2019-10-20","2019-11-14", "2019-11-15", "2019-12-16","2019-09-11", "2019-09-12", "2019-09-26", "2019-09-27", "2019-09-28","2019-09-29", "2019-09-30", "2019-10-01", "2019-10-02")),
a1c = c(5.5, 5.7, 5.5, 5.6, 5.5, 5.6, 5.5, 5.7, 5.5, 5.6, 5.5, 5.6, 5.5, 5.7, 5.5, 5.6, 5.5, 5.7, 5.5, 5.6, 5.5
)
)
res <- df %>%
arrange(person_id, date) %>%
group_by(person_id) %>%
mutate(
window_start = date,
window_end = date + days(6),
measurements_in_window = sapply(date, function(ws) sum(date >= ws & date <= ws + days(6)))
) %>%
filter(measurements_in_window >= 5) %>%
filter(date == min(date)) %>%
inner_join(df, by = "person_id") %>%
filter(date.y >= window_start & date.y <= window_end) %>% mutate(date = date.y, a1c = a1c.y) %>% select(person_id, date, a1c)
Output:
> res
# A tibble: 12 × 3
person_id date a1c
<dbl> <date> <dbl>
1 1000492 2019-10-15 5.5
2 1000492 2019-10-16 5.6
3 1000492 2019-10-18 5.5
4 1000492 2019-10-19 5.7
5 1000492 2019-10-20 5.5
6 1000499 2019-09-26 5.5
7 1000499 2019-09-27 5.6
8 1000499 2019-09-28 5.5
9 1000499 2019-09-29 5.7
10 1000499 2019-09-30 5.5
11 1000499 2019-10-01 5.6
12 1000499 2019-10-02 5.5