rdatetime-difference

Capture first observation every 30 days, drop those that are within the 30 day window


I am having unfortunate luck working on code to capture the first observation every 30 days from a previous observation. So, the 30-day window would reset upon a new 30+day observation from the last. This would be by a grouping ID. I can see that being difficult to understand so I wrote out an example dataset and a variable identifying what I would want to keep and delete. In addition, the code I am currently experimenting with.

df <- data.frame(id = c("a","a","a","a","a",'a',"b","b","b","b","b","b"),
                date = c('12/01/22','12/15/22','01/02/22','02/03/22','02/17/22','04/15/22',
                         '12/01/22','02/02/22','03/15/22','03/31/22','04/15/22','05/31/22'),
                keep = c('keep','delete','keep','keep','delete','keep',
                         'keep','keep','keep','delete','keep','keep'))
cutoff <- 30

df.t <- df %>% 
  # if date is < cutoff days of first date, maintain the same group
  # else create a new group
  group_by(g= accumulate(date, ~ if (.y - .x < cutoff) .x else .y)) %>%  
  # for each group select the first row
  slice_head(n = 1) %>% 
  # ungroup and remove grouping variable
  ungroup()                 

Updated with more relevant data for further assistance! Thank you!

structure(list(id_code = c("CH02-1", "CH02-10", "CH02-10", "CH02-10", 
"CH02-100", "CH02-100", "CH02-1000", "CH02-1001", "CH02-1002", 
"CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002", 
"CH02-1003", "CH02-1004", "CH02-1004", "CH02-1005", "CH02-1006", 
"CH02-1007", "CH02-1007", "CH02-1007", "CH02-1008", "CH02-1009", 
"CH02-101", "CH02-101", "CH02-101", "CH02-101", "CH02-1010", 
"CH02-1010", "CH02-1011", "CH02-1011", "CH02-1012", "CH02-1012", 
"CH02-1013", "CH02-1014", "CH02-1015", "CH02-1016", "CH02-1017", 
"CH02-1017", "CH02-1018", "CH02-1019", "CH02-1019", "CH02-1019", 
"CH02-102", "CH02-102", "CH02-1020", "CH02-1021", "CH02-1022", 
"CH02-1022"), date = structure(c(18789, 19014, 19041, 19048, 
18950, 18961, 18786, 18767, 18950, 18951, 18952, 18970, 18995, 
18996, 18860, 19043, 19045, 18702, 18709, 18814, 18818, 18846, 
18795, 18926, 18789, 18793, 18810, 18820, 18697, 18702, 18748, 
18765, 18783, 18785, 18796, 18901, 18690, 18966, 18817, 18825, 
18950, 18695, 18701, 18708, 18987, 19017, 18782, 18992, 18724, 
18739), class = "Date"), row = 1:50), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

  • Perhaps others will, but I can't see a way around a loop. Essentially, the function my_f() loops through the dates and then finds ones that are within 30 days (ahead) and removes them. I used split() to make a list of data frames by id and then apply the function to each one separately. This would allow you to take advantage of multiple cores with parallel::mclapply() if you wanted to.

    library(dplyr)  
    df <- structure(list(id_code = c("CH02-1", "CH02-10", "CH02-10", "CH02-10", 
    "CH02-100", "CH02-100", "CH02-1000", "CH02-1001", "CH02-1002", 
    "CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002", "CH02-1002", 
    "CH02-1003", "CH02-1004", "CH02-1004", "CH02-1005", "CH02-1006", 
    "CH02-1007", "CH02-1007", "CH02-1007", "CH02-1008", "CH02-1009", 
    "CH02-101", "CH02-101", "CH02-101", "CH02-101", "CH02-1010", 
    "CH02-1010", "CH02-1011", "CH02-1011", "CH02-1012", "CH02-1012", 
    "CH02-1013", "CH02-1014", "CH02-1015", "CH02-1016", "CH02-1017", 
    "CH02-1017", "CH02-1018", "CH02-1019", "CH02-1019", "CH02-1019", 
    "CH02-102", "CH02-102", "CH02-1020", "CH02-1021", "CH02-1022", 
    "CH02-1022"), date = structure(c(18789, 19014, 19041, 19048, 
    18950, 18961, 18786, 18767, 18950, 18951, 18952, 18970, 18995, 
    18996, 18860, 19043, 19045, 18702, 18709, 18814, 18818, 18846, 
    18795, 18926, 18789, 18793, 18810, 18820, 18697, 18702, 18748, 
    18765, 18783, 18785, 18796, 18901, 18690, 18966, 18817, 18825, 
    18950, 18695, 18701, 18708, 18987, 19017, 18782, 18992, 18724, 
    18739), class = "Date"), row = 1:50), row.names = c(NA, -50L), class = c("tbl_df", 
    "tbl", "data.frame"))
    df <- df %>% arrange(id_code, date)
    sp_df <- split(df, df$id_code)
    
    my_f <- function(x, cutoff=30){
      j <- 1
      while(j < nrow(x)){
        x <- x %>% filter((date - date[j]) <= 0 | (date - date[j]) > cutoff)
        j <- j+1
      }
      x
    }
    
    out <- bind_rows(
      lapply(sp_df, my_f))
    
    out
    #> # A tibble: 32 × 3
    #>    id_code   date         row
    #>    <chr>     <date>     <int>
    #>  1 CH02-1    2021-06-11     1
    #>  2 CH02-10   2022-01-22     2
    #>  3 CH02-10   2022-02-25     4
    #>  4 CH02-100  2021-11-19     5
    #>  5 CH02-1000 2021-06-08     7
    #>  6 CH02-1001 2021-05-20     8
    #>  7 CH02-1002 2021-11-19     9
    #>  8 CH02-1002 2022-01-03    13
    #>  9 CH02-1003 2021-08-21    15
    #> 10 CH02-1004 2022-02-20    16
    #> # ℹ 22 more rows
    

    Created on 2024-04-03 with reprex v2.0.2