rdplyrcase

Selecting grouped samples satisfying desired conditions


I have a dataframe like

raw.df <- data.frame(id = c("X01", "X02", "X03", "X04", "X05", "X06", "X07", "X08", "X09", "X10"),
           subject = c("S01", "S01", "S01", "S02", "S02", "S03", "S04", "S04", "S05", "S06"),
           time = c("D0", "D1", "D2", "D0", "D2", "D0", "D0", "D2", "D2", "D2"),
           response = c("Y", "Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "N"))

I want to create a new column to add "Y" for the samples satisfying conditions, and "N" for the others. If any subject with response="Y" has two samples from times D0 and D2, they should be selected. Based on that, I need to obtain the dataframe below with "selected" column.

final.df <- data.frame(id = c("X01", "X02", "X03", "X04", "X05", "X06", "X07", "X08", "X09", "X10"),
           subject = c("S01", "S01", "S01", "S02", "S02", "S03", "S04", "S04", "S05", "S06"),
           time = c("D0", "D1", "D2", "D0", "D2", "D0", "D0", "D2", "D2", "D2"),
           response = c("Y", "Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "N"),
           selected = c("Y", "N", "Y", "N", "N", "N", "Y", "Y", "N", "N"))

I can select samples without pairing using case_when by;

final.df <- raw.df %>%
  mutate(selected = case_when(time %in% c("D0", "D2") & response = "Y" ~ "Y",
                              TRUE ~ "N"))

for pairing, I can use another code

final.df %>%
  filter(selected == "Y") %>%
  group_by(subject) %>%
  add_count() %>% ungroup() %>%
  mutate(n = if_else(n == 2, "Include", "Exclude")) %>%
  dplyr::rename(`paired` = n) 

and then I can combine those, but I wonder if there is a better combined solution


Solution

  • One way could be to create a table of matching cases and join to it. The matching cases where are where time is D0 or D1, response is Y, and there are two distinct times per subject -- ie at least one Y for each of those two times. Then I put in N where the join doesn't have a match.

    raw.df |>
      left_join(
        raw.df |>
          filter(time %in% c("D0", "D2"), response == "Y") |>
          distinct(time, subject) |>
          filter(n() == 2, .by = subject) |>
          mutate(selected = "Y")) |>
      mutate(selected = coalesce(selected, "N"))
    

    Result

        id subject time response selected
    1  X01     S01   D0        Y        Y
    2  X02     S01   D1        Y        N
    3  X03     S01   D2        Y        Y
    4  X04     S02   D0        N        N
    5  X05     S02   D2        N        N
    6  X06     S03   D0        Y        N
    7  X07     S04   D0        Y        Y
    8  X08     S04   D2        Y        Y
    9  X09     S05   D2        Y        N
    10 X10     S06   D2        N        N