I am having unfortunate luck working on code to capture the first observation every 30 days from a previous observation. So, the 30-day window would reset upon a new 30+day observation from the last. This would be by a grouping ID. I can see that being difficult to understand so I wrote out an example dataset and a variable identifying what I would want to keep and delete. In addition, the code I am currently experimenting with.
df <- data.frame(id = c("a","a","a","a","a",'a',"b","b","b","b","b","b"),
date = c('12/01/22','12/15/22','01/02/22','02/03/22','02/17/22','04/15/22',
'12/01/22','02/02/22','03/15/22','03/31/22','04/15/22','05/31/22'),
keep = c('keep','delete','keep','keep','delete','keep',
'keep','keep','keep','delete','keep','keep'))
cutoff <- 30
df.t <- df %>%
# if date is < cutoff days of first date, maintain the same group
# else create a new group
group_by(g= accumulate(date, ~ if (.y - .x < cutoff) .x else .y)) %>%
# for each group select the first row
slice_head(n = 1) %>%
# ungroup and remove grouping variable
ungroup()
Updated with more relevant data for further assistance! Thank you!
structure(list(id_code = c("CH02-1", "CH02-10", "CH02-10", "CH02-10",
"CH02-100", "CH02-100", "CH02-1000", "CH02-1001", "CH02-1002",
"CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002",
"CH02-1003", "CH02-1004", "CH02-1004", "CH02-1005", "CH02-1006",
"CH02-1007", "CH02-1007", "CH02-1007", "CH02-1008", "CH02-1009",
"CH02-101", "CH02-101", "CH02-101", "CH02-101", "CH02-1010",
"CH02-1010", "CH02-1011", "CH02-1011", "CH02-1012", "CH02-1012",
"CH02-1013", "CH02-1014", "CH02-1015", "CH02-1016", "CH02-1017",
"CH02-1017", "CH02-1018", "CH02-1019", "CH02-1019", "CH02-1019",
"CH02-102", "CH02-102", "CH02-1020", "CH02-1021", "CH02-1022",
"CH02-1022"), date = structure(c(18789, 19014, 19041, 19048,
18950, 18961, 18786, 18767, 18950, 18951, 18952, 18970, 18995,
18996, 18860, 19043, 19045, 18702, 18709, 18814, 18818, 18846,
18795, 18926, 18789, 18793, 18810, 18820, 18697, 18702, 18748,
18765, 18783, 18785, 18796, 18901, 18690, 18966, 18817, 18825,
18950, 18695, 18701, 18708, 18987, 19017, 18782, 18992, 18724,
18739), class = "Date"), row = 1:50), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
Perhaps others will, but I can't see a way around a loop. Essentially, the function my_f()
loops through the dates and then finds ones that are within 30 days (ahead) and removes them. I used split()
to make a list of data frames by id
and then apply the function to each one separately. This would allow you to take advantage of multiple cores with parallel::mclapply()
if you wanted to.
library(dplyr)
df <- structure(list(id_code = c("CH02-1", "CH02-10", "CH02-10", "CH02-10",
"CH02-100", "CH02-100", "CH02-1000", "CH02-1001", "CH02-1002",
"CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002",
"CH02-1003", "CH02-1004", "CH02-1004", "CH02-1005", "CH02-1006",
"CH02-1007", "CH02-1007", "CH02-1007", "CH02-1008", "CH02-1009",
"CH02-101", "CH02-101", "CH02-101", "CH02-101", "CH02-1010",
"CH02-1010", "CH02-1011", "CH02-1011", "CH02-1012", "CH02-1012",
"CH02-1013", "CH02-1014", "CH02-1015", "CH02-1016", "CH02-1017",
"CH02-1017", "CH02-1018", "CH02-1019", "CH02-1019", "CH02-1019",
"CH02-102", "CH02-102", "CH02-1020", "CH02-1021", "CH02-1022",
"CH02-1022"), date = structure(c(18789, 19014, 19041, 19048,
18950, 18961, 18786, 18767, 18950, 18951, 18952, 18970, 18995,
18996, 18860, 19043, 19045, 18702, 18709, 18814, 18818, 18846,
18795, 18926, 18789, 18793, 18810, 18820, 18697, 18702, 18748,
18765, 18783, 18785, 18796, 18901, 18690, 18966, 18817, 18825,
18950, 18695, 18701, 18708, 18987, 19017, 18782, 18992, 18724,
18739), class = "Date"), row = 1:50), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
df <- df %>% arrange(id_code, date)
sp_df <- split(df, df$id_code)
my_f <- function(x, cutoff=30){
j <- 1
while(j < nrow(x)){
x <- x %>% filter((date - date[j]) <= 0 | (date - date[j]) > cutoff)
j <- j+1
}
x
}
out <- bind_rows(
lapply(sp_df, my_f))
out
#> # A tibble: 32 × 3
#> id_code date row
#> <chr> <date> <int>
#> 1 CH02-1 2021-06-11 1
#> 2 CH02-10 2022-01-22 2
#> 3 CH02-10 2022-02-25 4
#> 4 CH02-100 2021-11-19 5
#> 5 CH02-1000 2021-06-08 7
#> 6 CH02-1001 2021-05-20 8
#> 7 CH02-1002 2021-11-19 9
#> 8 CH02-1002 2022-01-03 13
#> 9 CH02-1003 2021-08-21 15
#> 10 CH02-1004 2022-02-20 16
#> # ℹ 22 more rows
Created on 2024-04-03 with reprex v2.0.2